microsoft / SandDance

Visually explore, understand, and present your data.
https://microsoft.github.io/SandDance
MIT License
6.4k stars 525 forks source link

Optimize performance for large datasets #281

Open drewkerwin opened 4 years ago

drewkerwin commented 4 years ago

I am trying to analyze a 500MB csv file and rendering time takes minutes on a strong Windows machine. Sometimes the plot will render the first time in a few minutes, but never again (if I change the x-axis for example). Is it possible to launch this tool without analyzing/rendering by default? That way I can choose my options and then render once? Also, is anyone working on improving the performance of this tool?

danmarshall commented 4 years ago

Hi @drewkerwin, thanks for the feedback. When creating a chart, we create a dependency graph of all the variables used to specify the layout. We keep this graph in memory to facilitate a more responsive interaction when a user changes a slider for example, we don't recompute the entire layout, just the parts that change based on the slider value: image As you've noticed, this optimization becomes a liability for large datasets, as it consumes more resources. In these cases, we would need to opt to degrade interactivity, and recompute the entire layout.

drewkerwin commented 4 years ago

Thank you @danmarshall, also note that a little pre-processing in python to reduce the size of the CSV help a great deal...e.g. extracting only the relevant columns into a modified CSV.