Closed phofl closed 1 month ago
Is this something that you all would consider merging?
Yes. This would extend what was done in https://github.com/pydata/xarray/pull/8412 to address #8409
I think I initially chose this approach to make the graph absolutely embarrassingly parallel.
Can you add a test to dask or confirm that dask has a test that makes sure the ordering/scheduling is "nice" for this kind of case (widely duplicated dependency)?
Can you add a test to dask or confirm that dask has a test that makes sure the ordering/scheduling is "nice" for this kind of case (widely duplicated dependency)?
This is the same pattern as we have with the open_dataset tasks and in the new dask array take implementation, we have existing tests covering these 2 patterns already
This is the same pattern as we have with the open_dataset tasks
Ya this pattern used to suck, so I avoided it. But good to know it's fixed. Thanks for the PR!
Ya this pattern used to suck, so I avoided it. But good to know it's fixed. Thanks for the PR!
TLDR is: If your task does not have a callable, it will be ignored during ordering
When looking at then
map_blocks
variant of the benchmark here, I noticed that we had a 30 MiB graph for the medium variant. 10MiB of those were just the repeated adding of the PandasIndexes as an argument for map_blocks. Writing them directly to the graph will de-duplicate the value, and thus only have this object once instead of many many times. We can then reference the key for the function arguments.The tokenise adds some overhead, so there is a drawback of this.
Happy to open an issue if required
Is this something that you all would consider merging?
cc @dcherian