sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2.06k stars 206 forks source link

Optimizing the performance of EDA module #257

Open dovahcrow opened 4 years ago

dovahcrow commented 4 years ago

Is your feature request related to a problem? Please describe. Currently, the plot and plot_correlation function are not performant in that some functions are not using dask properly. In detail, most severe pitfalls I found are holding GILs in the functions which forbids parallelism, and compute the dask graph prematurely multiple times (e.g. len(df) will trigger a computation). Note here that dask WON'T reuse computation results cross computes, so we'd better to call compute once at the very end.

Please take a look at the commit d76697791a1148387ab24bc319fc4f0b081736fc for the optimization on plot_missing.

Describe the solution you'd like Make plot and plot_missing a single computation graph as far as possible.

Describe alternatives you've considered No.

Additional context Using the 2nd gen scheduler will help a lot since that thing carries a dashboard for performance debugging.

Please do use Client(LocalCluster(processes=False, n_workers=1, threads_per_worker=<num of cpu cores on your machine - 1>) to start the client. By default the client is in multiprocess state and will do data copy across workers!

dovahcrow commented 4 years ago

@brandonlockhart @jinglinpeng Maybe you guys can discuss and take one function each? I'm available for discussion about the solution anytime.

brandonlockhart commented 4 years ago

Is it the case that we want to do a computation using Dask if and only if it is a computation over the entire dataset? For example in plot(df), for a bar chart we need the count of unique values (use Dask for this), then after executing the computation graph for plot(df), we should compute the percentages for the bar charts using pandas (since there are only ngroups computations). As another example, for the qq-plot in plot(df, x), compute the actual quantities, mean, std using Dask, then after all Dask computations for plot(df, x), compute the 100 normal quantiles (else, we would need a compute() in the middle of compute(df, x) to get the mean/std for the normal parameters). @dovahcrow