Open dovahcrow opened 4 years ago
@brandonlockhart @jinglinpeng Maybe you guys can discuss and take one function each? I'm available for discussion about the solution anytime.
Is it the case that we want to do a computation using Dask if and only if it is a computation over the entire dataset? For example in plot(df)
, for a bar chart we need the count of unique values (use Dask for this), then after executing the computation graph for plot(df)
, we should compute the percentages for the bar charts using pandas (since there are only ngroups
computations). As another example, for the qq-plot in plot(df, x)
, compute the actual quantities, mean, std using Dask, then after all Dask computations for plot(df, x)
, compute the 100 normal quantiles (else, we would need a compute() in the middle of compute(df, x)
to get the mean/std for the normal parameters). @dovahcrow
Is your feature request related to a problem? Please describe. Currently, the plot and plot_correlation function are not performant in that some functions are not using dask properly. In detail, most severe pitfalls I found are holding GILs in the functions which forbids parallelism, and compute the dask graph prematurely multiple times (e.g. len(df) will trigger a computation). Note here that dask WON'T reuse computation results cross
compute
s, so we'd better to call compute once at the very end.Please take a look at the commit d76697791a1148387ab24bc319fc4f0b081736fc for the optimization on
plot_missing
.Describe the solution you'd like Make plot and plot_missing a single computation graph as far as possible.
Describe alternatives you've considered No.
Additional context Using the 2nd gen scheduler will help a lot since that thing carries a dashboard for performance debugging.
Please do use
Client(LocalCluster(processes=False, n_workers=1, threads_per_worker=<num of cpu cores on your machine - 1>)
to start the client. By default the client is in multiprocess state and will do data copy across workers!