Reduce graph size through writing indexes directly into graph for ``map_blocks``

pydata / xarray

N-D labeled arrays and datasets in Python

https://xarray.dev

Apache License 2.0

3.63k stars 1.09k forks source link

Reduce graph size through writing indexes directly into graph for ``map_blocks`` #9658

Closed phofl closed 1 month ago

phofl commented 1 month ago

[x] Closes #8409
[x] Tests added

When looking at then map_blocks variant of the benchmark here, I noticed that we had a 30 MiB graph for the medium variant. 10MiB of those were just the repeated adding of the PandasIndexes as an argument for map_blocks. Writing them directly to the graph will de-duplicate the value, and thus only have this object once instead of many many times. We can then reference the key for the function arguments.

The tokenise adds some overhead, so there is a drawback of this.

Happy to open an issue if required

Is this something that you all would consider merging?

cc @dcherian

dcherian commented 1 month ago

Is this something that you all would consider merging?

Yes. This would extend what was done in https://github.com/pydata/xarray/pull/8412 to address #8409

I think I initially chose this approach to make the graph absolutely embarrassingly parallel.

Can you add a test to dask or confirm that dask has a test that makes sure the ordering/scheduling is "nice" for this kind of case (widely duplicated dependency)?

phofl commented 1 month ago

Can you add a test to dask or confirm that dask has a test that makes sure the ordering/scheduling is "nice" for this kind of case (widely duplicated dependency)?

This is the same pattern as we have with the open_dataset tasks and in the new dask array take implementation, we have existing tests covering these 2 patterns already

dcherian commented 1 month ago

This is the same pattern as we have with the open_dataset tasks

Ya this pattern used to suck, so I avoided it. But good to know it's fixed. Thanks for the PR!

phofl commented 1 month ago

Ya this pattern used to suck, so I avoided it. But good to know it's fixed. Thanks for the PR!

TLDR is: If your task does not have a callable, it will be ignored during ordering