lincc-frameworks / nested-dask

Connector project to enable Dask on Nested-Pandas
https://nested-dask.readthedocs.io/en/latest/
MIT License
5 stars 1 forks source link

PSA: Improved graph construction time for calls including `from_delayed` with `dask-expr>=1.1.13` #52

Open hendrikmakait opened 1 week ago

hendrikmakait commented 1 week ago

During today's Dask monthly meeting, people mentioned exceedingly long graph construction times related to delayed objects. I see that you are using from_delayed here. Please note that dask-expr>=1.1.13 includes a fix that improves graph construction time for large collections of delayed objects in dd.from_delayed by orders of magnitude (https://github.com/dask/dask-expr/pull/1132).

I don't know your specific problem, so I'm not sure if this will fix it, but I wanted to point it out nonetheless in case it helps.

hombit commented 1 week ago

@hendrikmakait. Thank you, I do confirm, that scheduling time (time between I hit .compute() and tasks start to stream) is significantly smaller after updating dask-expr from 1.1.11 to 1.1.13. So for a simple pipeline with 5k partitions it went from 156s to 12s, and for 40k partitions it went from >3 hours to 12 minutes.

hendrikmakait commented 1 week ago

That's great to hear!

Please let us know if you notice any other performance bottlenecks. 12 minutes may still be somewhat long, depending on the number of total tasks.

hombit commented 1 week ago

@hendrikmakait 12 minutes is significant amount of time, the "run time" was ~90 minutes on 60 3-thread workers, you can find more details in this notebook https://github.com/lincc-frameworks/notebooks_lf/blob/main/ztf_periodogram/SIMPLIFIED-ztf_periodogram_PSC.ipynb

It is also worth to mention that the manager node used ~90GB of memory. The task graph had ~0.5M tasks.

When I run a more complex notebook, "lazy cells" (where I plan dask computations before .compute()) took a dozen minutes. It is quite long and we do expect our users to implement an order of magnitude more complicated analysis.

hendrikmakait commented 1 week ago

Is there a public dataset that one could access to reproduce this? If not, I recommend profiling your code with py-spy. We could do a sanity check if you can provide a profile in the speedscope format. Also, 90 GiB of memory on the scheduler seems excessive, in particular with "only" 500k tasks.

hombit commented 6 days ago

@hendrikmakait The data is public, I made a notebook which uses the HTTPS data storage. The dataset volume is pretty large, but graph building part doesn't really touch any data files, so the manager node issues are reproducible without fetching any actual data files. The NB: https://github.com/lincc-frameworks/notebooks_lf/blob/main/ztf_periodogram/ztf-periodogram-data.lsdb.io.ipynb

It took 90GB of RAM and 7 minutes to schedule the graph. (Do I use the right terminology here? What I mean is the time between I hit .compute() and tasks start to stream).

I haven't tried py-spy yet, thanks for the recommendation!

hombit commented 5 days ago

@hendrikmakait I ran the code (up to the len(x.dask) point) and profiled it with py-spy and memray. I don't know much about Dask internals, so it's hard to tell what exactly is happening there, but I believe memray indicates that the graph size is actually ~80GB.

Here you can find the code and outputs for both profilers: https://github.com/lincc-frameworks/notebooks_lf/tree/main/ztf_periodogram/profile-dask-graph

I'd really appreciate it if you could take a look. I believe you would have much better insight into this!

Update 2024.09.12 I tried to turn off optimizations with dask.config.set(array_optimize=None, dataframe_optimize=None, delayed_optimize=None), but it didn't change anything.

hombit commented 5 hours ago

@hendrikmakait just a gentle reminder, could you please give a look to these profiler logs? Should I convert it to a Dask/dask-expr issue?

hendrikmakait commented 2 hours ago

Thanks for creating the profiles! I've had a brief look at them and there's nothing that would point toward an obvious issue at first glance. I'm not sure if I'll have any time soon to dig deeper into this. It looks like I'd have to understand the graph structures you create in much more detail to point out issues.