pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.49k stars 1.04k forks source link

Refactor Dataset internals to store data variables and coordinate variables as separate dicts #9203

Open TomNicholas opened 3 days ago

TomNicholas commented 3 days ago

There's a lot of other discussion in #9063, but I wanted to pull out this suggestion for independent discussion:

  • Our list of internal attributes on a DataTree node is still not just those on a Dataset plus the inherited coordinates

This is indeed a bit of a con, but in my mind the right fix is probably to adjust the Dataset data model to using dictionaries of data_variables and coord_variables, rather than the current solution of a dict of variables and a set of coord_names. Using a separate dictionary for coord_variables would also be more aligned with how DataArray is implemented. The internal Dataset data model is a hold-over from the very early days of Xarray, before we had a notion of coordinate variables that are not indexes.

Originally posted by @shoyer in https://github.com/pydata/xarray/issues/9063#issuecomment-2198775511

TomNicholas commented 2 days ago

In fact could the coord_variables dict just be an actual xr.Coordinates object?

(idea from https://github.com/pydata/xarray/issues/9204#issue-2384855788)

shoyer commented 2 days ago

In fact could the coord_variables dict just be an actual xr.Coordinates object?

I think Coordinates should remain a thin wrapper object. It needs access to both coordinate Variable objects and associated indexes.