pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.49k stars 1.04k forks source link

Add copy option in DataTree.from_dict #9193

Open Illviljan opened 3 days ago

Illviljan commented 3 days ago

Significantly speed up from_dict by removing unwanted copying.

import xarray as xr
from xarray.core.datatree import DataTree

run1 = DataTree.from_dict({"run1": xr.Dataset({"a": 1})})
d = {"run1": run1}
%timeit DataTree.from_dict(d, copy=True)  # 1.36 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit DataTree.from_dict(d, copy=False) # 178 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Illviljan commented 3 days ago
FAILED xarray/tests/test_treenode.py::TestSetNodes::test_set_child_node - TypeError: _set() got an unexpected keyword argument 'copy'
FAILED xarray/tests/test_treenode.py::TestSetNodes::test_set_grandchild - TypeError: _set() got an unexpected keyword argument 'copy'
FAILED xarray/tests/test_treenode.py::TestSetNodes::test_create_intermediate_child - TypeError: _set() got an unexpected keyword argument 'copy'
FAILED xarray/tests/test_treenode.py::TestSetNodes::test_overwrite_child - TypeError: _set() got an unexpected keyword argument 'copy'
FAILED xarray/tests/test_treenode.py::TestPruning::test_del_child - TypeError: _set() got an unexpected keyword argument 'copy'

Is treenode used anywhere else besides as a mixin-class for DataTree? DataTree overrides TreeNode._set anyway.

Illviljan commented 3 days ago
FAILED xarray/tests/test_datatree.py::TestSetItem::test_grafted_subtree_retains_name - AssertionError: assert 'new_subtree_name' == 'original_subtree_name'

  - original_subtree_name
  ? ----- ^^
  + new_subtree_name
  ?  ^^

This test is a mutability test and a typo caused this error. This test used root["new_subtree_name"] = subtree # noqa which uses _set_item under the hood and should use copy=True by default.

shoyer commented 2 days ago

Elsewhere in Xarray (e.g., on Dataset.copy()), the copy keyword argument refers specifically to whether copying array data is copied, and metadata fields are always copied. This seems potentially inconsistent with that?

Illviljan commented 2 days ago

Maybe _fastpath is better?

shoyer commented 2 days ago

Maybe _fastpath is better?

👍 for a private constructor, if we want to use this internally inside Xarray.

Illviljan commented 2 days ago

Well it wouldn't be for internal xarray usage for me, it would be for a backend/mfdatatree function. But it is a typical way we use to speed things up and implying checks are being bypassed, do we have any other public examples?

My real problem is pretty similar to the intro example but I have 100+ datatrees instead, such a simple operation takes 1.5 minutes because of all the copies, it goes down to around 19ms with this pr.

shoyer commented 1 day ago

My real problem is pretty similar to the intro example but I have 100+ datatrees instead, such a simple operation takes 1.5 minutes because of all the copies, it goes down to around 19ms with this pr.

Yikes, that is pretty bad!

Do you have a slightly more realistic benchmark you could share, and/or add to the asv tests?

I am sure we can do much better for performance in from_dict, but there are a few related issues here that I would like get right first, including a major refactor to DataTree internals (https://github.com/pydata/xarray/pull/9063) and changing the DataTree constructor (https://github.com/pydata/xarray/issues/9196).

Once we have the right behavior, it makes sense to start working on performance. I am optimistic that we can figure out how to improve performance without needing to expand the public API.