Feature Request: Hierarchical storage and processing in xarray

emilbiju commented 4 years ago

I am using xarray for processing geospatial data and have encountered two major challenges with existing data structures in xarray:

Data arrays stored in an xarray Dataset cannot be grouped into hierarchical levels/logical subsets to reflect the internal organisation of the data. This makes it difficult to identify and process a subset of the data variables that pertain to a specific problem.
When two data arrays having a shared dimension but different coordinate values along the dimension are merged into a Dataset, the union of coordinate values from the 2 data arrays becomes the new coordinate set corresponding to that dimension. Consequently, when the value of a variable in the dataset corresponding to a coordinate value is unknown, nan is used as a substitute which results in memory wastage.

I would like to suggest a tree-based data structure for xarray in which the leaves store individual data arrays and the other nodes store the hierarchical information. Since data arrays are stored independently, each dimension only needs to be associated with coordinate values that are valid for that data array.

To meet these requirements, I have implemented a data structure that also supports the below capabilities:

Standard xarray methods can be applied to the tree at all hierarchical levels, i.e., when a function is called at a hierarchical level, it is mapped over all data arrays that occur at the leaves under the corresponding node. For example, say I have a tree object (lets call it dt) with child nodes: weather, satellite image and population. Each of these nodes has data arrays/subtrees under it.

The mean over time of all data variables associated with weather can be obtained using dt.weather.mean('time') which applies the function to sea_surface_temperature, dew_point_temperature, wind_speed and pressure.

It can be encoded into the netCDF format, like xarray Datasets.
It supports item assignment at all hierarchical levels.

I would like to know of the possibility of introducing such a data structure in xarray and the challenges involved in the same.

jhamman commented 4 years ago

@emilbiju - thanks for opening an issue here. You may want to take a look at the conversation in #1092.

emilbiju commented 4 years ago

Thanks @jhamman for sharing the link. Here are my thoughts on the same:

For use-cases similar to the one I have mentioned, I think it would be more meaningful to allow the tree structure (calling it Datatree further) to exist as a separate data structure instead of residing within the Dataset. From what I understand, the xarray Dataset would enforce all its component variables to share the same coordinate set for a given dimension name. This would again result in memory wastage with nan values when the value corresponding to a coordinate is unknown.

Besides, xarray only allows attribute access for getting (and not setting) values, but a separate data structure can allow attribute access for setting values as well. For example, the data structure that I have implemented would allow something like dt.weather = dt.weather.mean('time') to alter all the data arrays under the weather node.

I am currently using attribute-based access for accessing child nodes/data arrays in the Datatree as it appears to reflect the tree structure better, but as @shoyer has pointed out, tuple-based access might be easier to use programmatically.

Instead of using netCDF4 groups for encoding the Datatree, I am currently following a simple 3-step process:

Combine all the data arrays at the leaves of a Datatree object into a dataset.
Add an additional data array to the dataset that would contain an ancestor matrix (or any other array-like representation) that can encode the hierarchical structure with a coordinate set containing names of the tree nodes.
Use the xarray.Dataset.to_netcdf method to store it in a netCDF file.

Therefore, within the netCDF file, it would exist just as a Dataset. A specially implemented Datatree.open_datatree method can open the dataset, detect this additional array and recreate the tree structure to instantiate the object. I would like to know if using netCDF4 groups instead provide any advantages over this approach?

dcherian commented 4 years ago

Thanks for writing this up @emilbiju . These are very interesting ideas

The nice thing about using NetCDF groups (or HDF5?) is that it is a standard and your data files are readable using other software.
So far, xarray has been reluctant to add "groups" or this kind of hierarchical organization because of all the additional complexity involved (#1092)
That said, there is definitely interest in a package that provides a high-level object composed of multiple xarray datasets (again #1092). So I encourage you to post your code online so others can try it out and iterate.

a. For example, our friends over at Arviz have a InferenceData structure composed of multiple Datasets that is represented on-disk using NetCDF groups: https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html

shoyer commented 4 years ago

I would be open to exploring adding a hierarchical data structure into xarray (on an experimental basis, to start), but it would need someone with serious interest and time to make it happen. Certainly there are plenty of use cases across various fields.

shoyer commented 4 years ago

The data model you sketch out here looks very similar to what we discussed in #1092. I agree that the semantics are well defined.

The main question in my mind is whether it would make more sense to make an entirely new data structure (e.g., xarray.TreeDataset) or add in a new feature like groups to the existing xarray.Dataset.

Probably a new data structure would be easier at this point, because would keep Dataset simpler and wouldn't break existing code that works on xarray.Dataset.

jhamman commented 3 years ago

@joshmoore - based on https://github.com/pangeo-forge/pangeo-forge/pull/27#issuecomment-755397835, you may be interested in this issue. One way to do multiscale datasets in Xarray would be to use hierarchical groups (one group per scale).

davidbrochart commented 3 years ago

a. For example, our friends over at Arviz have a InferenceData structure composed of multiple Datasets that is represented on-disk using NetCDF groups: https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html

Just a note that this link has moved to: https://arviz-devs.github.io/arviz/getting_started/XarrayforArviZ.html

joshmoore commented 3 years ago

Thanks for the link, @jhamman. The most immediate issue I ran into when trying to use xarray with OME-Zarr data does seem similar. A rough representation of one multiscale image is:

image_pyramid:
  |_ zyx_array_high_res
  |_ zyx_array_mid_res
  |_ zyx_array_low_res

but of course the x, y and z dimensions are of different sizes in each volume.

thewtex commented 3 years ago

@jhamman @joshmoore a prototype to bring together XArray and OME-Zarr/NGFF with multiple groups: https://github.com/OpenImaging/miqa/blob/master/server/scripts/compress_encode.py

rabernat commented 3 years ago

On today's Xarray dev call, we discussed pursuing another CZI grant to support this feature in Xarray. The image pyramid use case would provide a strong link to the bioimaging community. @alexamici and the B-open folks seem enthusiastic.

I had to leave the meeting early, so I didn't hear the end of the conversation. But did we decide who might serve as PI for such a proposal?

dcherian commented 3 years ago

But did we decide who might serve as PI for such a proposal?

No.

@emilbiju are you interested in open-sourcing your work?

benbovy commented 3 years ago

FWIW, a while ago I wrote a mock-up (and probably outdated) DatasetNode class:

https://gist.github.com/benbovy/92e7c76220af1aaa4b3a0b65374e233a (nbviewer link)

tacaswell commented 3 years ago

This is related to some very recent work we have been doing at NSLS-II, primarily lead by @danielballan .

OriolAbril commented 3 years ago

Not really sure if there is anything we can do from ArviZ to help with that, if there is let us know and we'll do our best cc @percygautam

aurghs commented 3 years ago

@alexamici and I can write the technical part of the proposal.

joshmoore commented 3 years ago

Happy to provide assistance on the image pyramid (i.e. "multiscale") use case.

rabernat commented 3 years ago

So we have:

Numerous promising prototypes to draw from
A technical team who can write the proposal and execute the proposed work (@aurghs & @alexamici of B-open)
Numerous supporting use cases from the bioimaging (@joshmoore), condensed matter (@tacaswell), and bayesian modeling (ArviZ; @OriolAbril) domains

We are just missing a PI, someone who is willing to put their name on top of the proposal and click submit. I have gone on record as committed to not leading any new proposals this year. And in any case, this is a good opportunity for someone else from the @pydata/xarray core dev team to try on a leadership role.

danielballan commented 3 years ago

I volunteer to contribute writing to this from the condensed matter / sychrotron user facility perspective.

dcherian commented 3 years ago

I can shoulder part of the load and help is definitely needed. LOI is due on Tuesday. I'll take a stab this evening and post a link.

OriolAbril commented 3 years ago

Here are some biomedical papers that are using ArviZ and therefore xarray even if most don't cite xarray and some don't cite ArviZ either. Topics are quite disperse: covid, psychology, biomolecules, oncology...

Some ArviZ recent biomedical citations

* Arroyuelo, A., Vila, J., & Martin, O. A. (2020). Exploring the quality of protein structural models from a Bayesian perspective. bioRxiv. * Axen, S. D. (2020). Representing Ensembles of Molecules (Doctoral dissertation, UCSF). * Brauner, J. M., Mindermann, S., Sharma, M., Johnston, D., Salvatier, J., Gavenčiak, T., ... & Kulveit, J. (2021). Inferring the effectiveness of government interventions against COVID-19. Science, 371(6531). * Busch-Moreno, S., Tuomainen, J., & Vinson, D. (2020). Trait Anxiety Effects on Late Phase Threatening Speech Processing: Evidence from EEG. bioRxiv. * Busch-Moreno, S., Tuomainen, J., & Vinson, D. (2021). Semantic and prosodic threat processing in trait anxiety: is repetitive thinking influencing responses?. Cognition and Emotion, 35(1), 50-70. * Dehning, J., Zierenberg, J., Spitzner, F. P., Wibral, M., Neto, J. P., Wilczek, M., & Priesemann, V. (2020). Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions. Science, 369(6500). * Heilbron, E., Martìn, O., & Fumagalli, E. (2020). Efectos protectores de los alimentos andinos contra el daño producido por el alcohol a nivel del epitelio intestinal, una aproximación estadística. Ciencia, Docencia y Tecnología, 31(61 nov-mar). * Legrand, N., Nikolova, N., Correa, C., Brændholt, M., Stuckert, A., Kildahl, N., ... & Allen, M. (2021). The heart rate discrimination task: a psychophysical method to estimate the accuracy and precision of interoceptive beliefs. bioRxiv. * Wang, Y. (2020, September). Data Analysis of Psychological Measurement of Intelligent Internet-assisted Sports Training based on Bio-Sensors. In 2020 International Conference on Smart Electronics and Communication (ICOSEC) (pp. 474-477). IEEE. * WASSERMAN, A., SHRAGER, J., & SHAPIRO, M. A Multilevel Bayesian Model for Precision Oncology. * Weindel, G., Anders, R., Alario, F. X., & Burle, B. (2020). Assessing model-based inferences in decision making with single-trial response time decomposition. Journal of Experimental Psychology: General. * Yamagata, Y. (2020). Simultaneous estimation of the effective reproducing number and the detection rate of COVID-19. arXiv e-prints, arXiv-2005.

shoyer commented 3 years ago

I'm excited to see this coming together! I would be happy to advise as well...

Side note: at some point, this would probably be worth adding to Xarray's official roadmap.

aurghs commented 3 years ago

We could also provide a use-case in remote sensing: it would be really useful in the interferometric processing for managing Sentinel-1 IW and EW SLC data, which has multiple tiles (burts) partially overlapping in one direction (azimuth).

TomNicholas commented 3 years ago

This sounds like an interesting project - I'm also about to be able to work on xarray much more directly (thanks @rabernat ).

Should I add this as another xarray project board alongside explicit indexes and so on?

I wonder if this could find another domain use case in plasmapy as part of the overall plasma object @StanczakDominik? At the very least this would allow you to store all the various equilibrium and diagnostics information that goes in an EFIT file.

StanczakDominik commented 3 years ago

Whoa, that sounds awesome! Thanks for the heads up :) Definitely could be quite handy, looking forward to seeing how this develops. @rocco8773 this should be interesting for you as well :)

thewtex commented 3 years ago

For scientific imaging, i.e. biomicroscopy, medical imaging, where xarray compatibility is being considered in the NGFF, it would be helpful to avoid unnecessary divergence by ensuring the proposed hierarchical storage is compatible. This would mean:

Each scale / group can be independently treated as an xarray.Dataset.
They are organized in such a way that the collection of scales can be referenced as it is now, i.e. as a collection of paths,

  “multiscales”: [
    {
      “datasets” : [
          {"path": "0"},
          {"path": "1"},
          {"path": "2"},
          {"path": "3"},
          {"path": "4"}
        ]
      “version” : “0.1”
    }
  ]
}

joshmoore commented 3 years ago

Picking up on @dcherian's https://github.com/pydata/xarray/issues/4118#issuecomment-806954634 and @rabernat's https://github.com/ome/ngff/issues/48#issuecomment-833456889, Zarr was also accepted to the second round and certainly references this issue in case we want to sync up. (Apologies if I missed where that discussion moved.)

nbercher commented 3 years ago

A simple comment/question:

In xarray.Dataset, why not just use the Unix-path notation into a "flat" dict model?

Actually, netCDF4 implements this Unix-like path access to groups and variables: /path/to/group/variable.

All of the hierarchical stuff (e.g., getting a sub-Dataset from a random group) and conventions (e.g., dimensions scoping rule) would then be driven by the parsing of strings only. It's all about symbolic names (like in a file system right?) and there would be not any hierarchical data in memory anymore.

My question is then: Are there some tricky points for xarray.Dataset not to go this simple way?

Some related remarks:

About the attribute access to variables: I don't really know why this exist at all since it is all about mixing unrelated namespaces: (1) the class internals and (2) the user's variables one. Mixing namespaces seems very bad to me: it makes some variable names forbidden in order to avoid any collision between the two namespaces, it usually imply unnecessarily complex code with corner cases to deal with.
About netCDF4 being a self-described format: xarray API has open_dataset(filepath), but this function is unable to read the whole file in memory without getting help from a priori file content description, i.e., the names of the groups if you follow me. Considering xarray for simple tasks like geographical-selection-cropping, it seems to ignore the self-describing nature of netCDF4 format. As far as I can understand the situation, a "flat" model could be a good way to go.

dcherian commented 3 years ago

cc @d-v-b and https://github.com/JaneliaSciComp/xarray-multiscale

TomNicholas commented 3 years ago

Flagging another possible use case, this time in Magnetic Confinement Fusion: representing the IMAS data model.

IMAS is currently closed-source (being part of the ITER project), but there is a big push to make it open-source and the standard data model for tokamak plasma data.

I'm not very familiar with IMAS (@smithsp and @orso82 are more so), but it is hierarchical. There is some more information in appendix A3 of this paper, which talks about "taking advantage of the homogeneity of grid sizes that is commonly found across arrays of structures", which sounds very closely related to the DataTree proposal.

This might allow the xarray.DataTree to do more of the heavy-lifting within OMAS (which already uses xarray, and is intended to be compatible with IMAS).

shoyer commented 3 years ago

@martinitus raises a really interesting point about tags vs hierarchical structures over in https://github.com/pydata/xarray/issues/1092#issuecomment-868324949

However, one point I didn't see in the discussion is the following:

Hierarchical structures often force a user to come up with some arbitrary order of hierarchy levels. The classical example is document filing: do you put your health insurance documents under /insurance/health/2021, 2021/health/insurance,....?

One solution to that is a tagging of documents instead of putting them into a hierarchy. This would give the full flexibility to retrieve any flat DataSet out of a TaggedDataSet by specifying the set of tags that the individual DataArrays must be listed under.

I think using tags is a really interesting alternative to hierarchies. I don't have a clear sense of the overall tradeoffs, though.

TomNicholas commented 3 years ago

I think using tags is a really interesting alternative to hierarchies. I don't have a clear sense of the overall tradeoffs, though.

That is interesting. I think there is an argument for using a hierarchical model to map onto the full netCDF data model with groups, but perhaps methods to select elements via tags could be included too, for the best of both?

TomNicholas commented 3 years ago

@shoyer if you used tags wouldn't you lose the ability to round-trip a netCDF file with groups? When you read in the groups from the file you would be throwing information away by going from a hierarchy A/B to simply tags A&B, and there wouldn't be a way to restore that before calling .to_netcdf() would there?

shoyer commented 3 years ago

if you used tags wouldn't you lose the ability to round-trip a netCDF file with groups?

That sounds right to me -- a downside of tags is that they can't be (uniquely) expressed in a hierarchical arrangement like those found in HDF5/netCDF4 files.

But if this is a better way to organize data in memory, we could consider how to make an adapter layer for on disk storage.

TomNicholas commented 3 years ago

Some other thoughts about tags:

1) Does the definition of tags include variable names of DataArrays? I think it should.

2) As @martinitus mentioned, a DataTree containing only leaves with only 1 tag each is effectively a Dataset. I wonder if Dataset could be refactored to be a special case of a more general DataTree, possibly as a subclass?

3) Selecting via tags would need to allow a distinction between "get me all leaves with these exact tags" and "get me all leaves whose tags include these ones". Maybe dt.choose_only(tags) and dt.choose_all(tags)?

4) The latter type of tag-based access would make plotting different leaves against one another easier too - given a multi-resolution (or multi-model) datatree like this:

dt
|-- high_res
|    |-- temperature
|    |-- CO2
|-- medium_res
|    |-- temperature
|    |-- CO2
|-- low_res
|    |-- temperature
|    |-- CO2

then assuming that the definition of tags included the DataArray variable names, then

dt.choose_all('temperature').plot.line(x='time')

would select all leaves with a tempature tag, check that the temperature DataArrays had the same dimensions (but no need for any time coordinates to share size or values), and then plot them against one another on the same axes. This would be so useful - I would say this use case is 90% of the reason users iterate over dictionaries of datasets currently.

5) With a tag-based system you can create cycles of tags, like A&B, B&C, C&A, which you can't really do with hierarchical trees. I don't think that actually causes any problems though...

tacaswell commented 3 years ago

That sounds right to me -- a downside of tags is that they can't be (uniquely) expressed in a hierarchical arrangement like those found in HDF5/netCDF4 files.

hdf5 allows for internal links so a datasets and groups can appear in multiple places in the tree. You can even make cycles where groups are in them selves (or their children). The NeXuS format (the xray/neutron one) makes heavy use of this to let data appear both where it "makes sense" from a science point of view from an instrumentation point of view.

I think it is reasonable to expect that netcdf -> xarray -> netcdf always , however I think it is unreasonable to ask that xarray -> netcdf -> xarray will always work. I think it is OK if xarray can express more complex relationship and structures that you can in netcdf (or hdf5 or any existing at-rest format). In an extreme case, consider an interface to a database that returns xarrays 😈 .

martinitus commented 3 years ago

As a user who (so far) does not use any netCDF or HDF5 features of xarray I obviously would not like to have a otherwise potentially useful feature blocked by restrictions imposed by netCDF or HDF5 ;-).

That said - I think @tacaswell comment about round trips is very reasonable and such invariants should be maintained! It would be extremely confusing for users if netcdf -> xarray-> netcdf is not a "no-op". The same obviously holds true for any other storage format. As a user I would generally expect something like the following:

a1= xarray.load("foo.myformat")
xarray.save( a1, "bar.myformat")
a2= xarray.load("bar.myformat")
assert a1 == a2, "Why should they not be exactly equal?!?"

TomNicholas commented 3 years ago

I think that xarray's current use of both dict-like access and attribute-like access for variables makes representing a general netCDF file in a single DataTree incompatible with the nice syntax that @emilbiju originally suggested.

Consider a tree with a node structure for a hypothetical DataTree object dt that looks something like

DataTree("root")
|-- DatasetNode("weather")
|   |-- DatasetNode("temperature")
|   |   |-- DataArrayNode("sea_surface_temperature")
|   |   |-- DataArrayNode("dew_point_temperature")
|   |-- DataArrayNode("wind_speed")
|-- DataArrayNode("population")

We ideally want to be able to seamlessly access both subtrees and individual variables via chains of keys, e.g. weather_subtree = dt['weather'], and wind_speed_da = dt['weather']['wind_speed']. (We want that so that each subtree behaves as much like an xarray.Dataset as possible, with respect to mapping functions over all its child nodes and so on.)

This particular example is fine, and would correspond to a netCDF file with groups "root", "root/weather", and "root/weather/temperature", plus the four stored DataArray variables.

However, if one of the variables has the same name as one of the groups (which I think is permitted in the netCDF format), then there is no easy way to access all the elements whilst retaining the nice syntax. For example consider

DataTree("root")
|-- DatasetNode("A")
|   |-- DatasetNode("B")
|   |   |-- DataArrayNode("foo")
|   |   |-- DataArrayNode("bar")
|   |-- DataArrayNode("B")
|-- DataArrayNode("C")

Now we have a key collision between the group named "B" and the DataArray named "B", i.e. dt['A']['B'] is ambiguous.

We can't just forbid this type of tree because then there would be netCDF files that we couldn't represent as a DataTree, so we would not have the property netCDF -> xarray.DataTree -> netCDF in general.

We can't use different types of access (e.g. subtree = dt.A.B for the subtree and da = dt.A['B'] for the variable, because we've already given up the .B namespace to also point to the variable (i.e. same location as ['B']). If we break that convention it's going to be very confusing for users who are expecting the root of the DataTree to behave like xarray.Dataset currently does.

(We could divide access through __call__ like ds['A']('B') but that wouldn't be very pythonic).

The only way I can see around this is to hide a node's data variables behind a .ds property (i.e. da = dt['A'].ds['B']), or get groups via a dedicated method (i.e. subtree = dt.get_child('A')), but those are so much more ugly and less intuitive that it feels like a shame to have to do that.

It sounds like @emilbiju avoided this by not satisfying netCDF -> xarray.DataTree -> netCDF:

(Instead of using netCDF4 groups for encoding the Datatree ... within the netCDF file, it would exist just as a Dataset)

so I'm wondering if anyone else has other suggestions or thoughts?

shoyer commented 3 years ago

However, if one of the variables has the same name as one of the groups (which I think is permitted in the netCDF format), then there is no easy way to access all the elements whilst retaining the nice syntax.

NetCDF does not allow variables and groups with the same name, e..g,

import netCDF4

nc = netCDF4.Dataset('testing.nc', 'w')
nc.createVariable('foo', float)
nc.createGroup('foo')
# RuntimeError: NetCDF: String match to name in use

I'm pretty sure this is also prohibited for all HDF5 files, just like how you can't have a directory and file with the same name on most filesystems.

TomNicholas commented 3 years ago

Oh excellent, thanks for the clarification Stephan!

On Thu, 19 Aug 2021, 00:23 Stephan Hoyer, @.***> wrote:

However, if one of the variables has the same name as one of the groups (which I think is permitted in the netCDF format), then there is no easy way to access all the elements whilst retaining the nice syntax.

NetCDF does not allow variables and groups with the same name, e..g,

import netCDF4 nc = netCDF4.Dataset('testing.nc', 'w')nc.createVariable('foo', float)nc.createGroup('foo')# RuntimeError: NetCDF: String match to name in use

I'm pretty sure this is also prohibited for all HDF5 files, just like how you can't have a directory and file with the same name on most filesystems.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/4118#issuecomment-901598698, or unsubscribe https://github.com/notifications/unsubscribe-auth/AISNPI4WWZ3ZBJNKK4HMLWDT5SBMDANCNFSM4NQEIKFQ .

TomNicholas commented 3 years ago

So I had a crack at making a full DataTree class - you can find it in this repo.

It's based on @benbovy's DatasetNode example - the basic idea is that each tree node wraps a single Dataset. The differences are that this effort:

Uses a NodeMixin from anytree for the tree structure,
Implements path-like and tag-like getting and setting,
Has functions for mapping user-supplied functions over every node in the tree,
- Automatically dispatches xarray.Dataset's API over every node in the tree (such as .isel or __add__),
- Has a bunch of tests,
- Has a printable representation that currently looks like this:

Some limitations of the approach I used are:

Each dataset in the tree is entirely separate, so doing something like dt.sel(time=50) would require each Dataset in that subtree to have it's own coordinate called 'time'. (That's normally useful though because then 'time' can be a different resolution on each ds),
While you can access nodes via tags, the underlying implementation is in terms of paths, so ('folder1', 'folder2') points to a different node than ('folder2', 'folder1'),
There's no support for symbolic nodes yet, and I'm unsure if this design can allow for loops or not.

You can create a DataTree object in 3 ways: 1) Load from a netCDF file that has groups via open_datatree(), 2) Using the init method of DataTree, which accepts a nested dictionary of Datasets, 3) Manually create individual nodes with DataNode() and specify their relationships to each other, either by setting .parent and .children attributes, or through __get/setitem__ access, e.g. dt['path/to/node'] = DataNode('node_name', data=xr.Dataset()).

It's about 70% working, but some things I could do with some help with are: 1) ~Fundamental design questions about the class structure, such as whether DataTree should be a subclass of Dataset?~ 2) ~Getting arithmetic and ufuncs to act properly on the whole tree~, 3) ~Saving a tree to a single netCDF file~, (thanks Joe!) 4) ~Setting up CI and all that jazz~, (thanks Joe again!) 5) ~Setting up basic docs.~

There will definitely be many bugs, but any thoughts or input appreciated!

jhamman commented 3 years ago

Thanks @TomNicholas! I've just been starting to look into this. I'm going to give it a spin and would be happy to help with your numbers 3 and 4.

TomNicholas commented 3 years ago

Thanks @jhamman - expect things to break as I keep realizing certain methods have to be defined differently from in Dataset for things to work.

Help with 3 would be especially appreciated, as at the moment whilst I can open and alter a file with groups, I can't save my resulting tree :sweat_smile:

benbovy commented 3 years ago

Great work @TomNicholas!

For rich/html reprs, I think that we could take much inspiration from some of the dask reprs shown in this blog post.

I haven't looked at your repository in detail yet, but I have one general question about the design: what is the rationale of having two separate classes DataTree and DataNode? Could those classes be merged somehow? To me it seems that any node could be the root of a (sub)tree. h5py has a separate File object that does double duty as the HDF5 root group, and serves as entry point into the file, but in the case of Xarray maybe this is less relevant since Xarray object are abstract data containers? Zarr (abstract data store) has no such separate class and uses a regular zarr.hierarchy.Group as the root.

TomNicholas commented 3 years ago

Thanks @benbovy !

For rich/html reprs, I think that we could take much inspiration from some of the dask reprs shown in this blog post.

I don't know much about HTML, but graphs where you can mouseover nodes to see node information sound awesome!

what is the rationale of having two separate classes DataTree and DataNode?

They aren't separate: DataNode is merely a (perhaps badly-named) pointer to a second init method for the same DataTree class.

The idea was that creating a single node of a tree by specifying only its (name, dataset, parent, children) attributes was conceptually different to "I have loads of datasets, and I want to arrange them all into one big tree using path-like addresses", so I made two different init methods on DataTree to cover that. The idea was from the xarray.Dataset._construct_direct() classmethod, which creates a new instance of a Dataset by directly setting attributes like (variables, coord_names, dims, attrs). That is an internal classmethod though, and isn't externally exposed like DataNode().

We could just merge the two signatures into one __init__ method though, or use a less confusing name (I just didn't want DataTree._init_single_node(name, data, parent) everywhere in my tests.) Also internally it's nice to have a separate ._init_single_node() method because that's (a) closer to the super().__init__() defined by TreeNode, and (b) doesn't require calling the fairly complex getting and setting methods.

Could those classes be merged somehow?

They were originally separate (I had DataTree and DatasetNode, where the former was a subclass of the latter), but then I merged them together in Condense DatasetNode and DataTree into a single DataTree class #11.

Zarr (abstract data store) has no such separate class and uses a regular zarr.hierarchy.Group as the root.

Good to know that other nested structures took a similar approach. I think that as we want to be able to save and load any subtree even after changing parents etc. then we ideally don't want to treat any one node as special.

TomNicholas commented 2 years ago

We would like some opinions from the community on two different possible models for a tree-like structure in xarray.

A tree contains many groups, but the question is what constraints should be imposed on the contents of those groups.

Option (1) - Each group is a Dataset
- Means that within each group the same restrictions apply as currently do within a single dataset, i.e. each dimension name is only associated with a single length, so there is effectively a common set of dimensions which variables can depend on.
- Can't represent all files, in particular can't represent a filetype where groups are allowed to have variables with inconsistent length dimensions (e.g. Zarr stores allow this as all arrays are independent.)
- Model maps more directly onto netCDF (though still not exactly, because netCDF has dimensions as separate objects)
- This means that sometimes you might need to put variables in ajdacent groups in the same level of the tree, when you might rather want them together in the same group.
- Enforcing consistency between variables guarantees certain operations are always well-defined (in particular selection via an integer index like in .isel).
- Guarantees that all valid operations on a Dataset are also valid operations on a single group of a DataTree - so API can be essentially identical to Dataset.
- Metadata (i.e. .attrs) are arguably most useful when set at this level
- Mental model is a (nested) dict of Datasets
- Prototype is DataTree
Option (2) - Variables within groups are unconstrained
- Means that within a single group each Variable can have any dimensions, of any length. There is no requirement that two variables which both depend on a dimension called "x" have to have the same length, one variable can have .sizes['x']=10 and the other have .sizes['x']=20.
- The main advantage of this is that it can represent a wider set of files (including all Zarr stores and a wider set of GRIB files)
- Model maps more directly onto HDF5
- Doesn't enforce the (arguably fairly arbitrary) constraint that if variables have a dimension of the same name, that dimension must also be the same length
- Without consistency selection becomes ill-defined, but many other operations are fine (e.g. taking .mean())
- Mental model is a (nested) dict of dicts of DataArrays
- Prototype is xarray-DataGroups

This is by no means the only question, and we have various choices to make within these options.

The questions for the potential users here are:

Do you have use cases which one of these designs could handle but the other couldn't?
How important to you is being able to support all valid files of these certain formats?
Which of these designs is clearer/more intuitive/more appealing to you?

(@alexamici , @shoyer, @jhamman, @aurghs please edit this comment to add anything I've missed)

mraspaud commented 2 years ago

Thanks for launching this discussion @TomNicholas ! I'm a core dev of pytroll/satpy which handles earth observing satellite data. I got interested in DataTree because we have data from the same instruments available at mulitple resolution, hence not fitting into a single Dataset. For use Option 1 is probably feeling better. Even when having data at multiple resolutions, it is still a limited number of resolutions and hence splitting them in groups is the natural way of going I would say. We do not use the features you mention in Zarr or GRIB, as a majority of the satellite data we use is provided in netcdf nowadays. Don't hesitate to ask if you want to know more or if something is unclear, we are really interested in these developments, so if we can help that way...

alexamici commented 2 years ago

@TomNicholas (cc @mraspaud)

Do you have use cases which one of these designs could handle but the other couldn't?

The two main classes of on-disk formats that, I know of, which cannot be always represented in the "group is a Dataset" approach are:

in netCDF following the CF conventions for groups, it is legal for an array to refer to a dimension or a coordinate in a different group and so arrays in the same group may have dimensions with the same name, but different size / coordinate values, (this was the orginal motivation to explore the DataGroup approach)
the current spec for the Next-generation file formats (NGFF) for bio-imaging has all scales of the same 5D data in the same group. (cc @joshmoore)

I don't have an example at hand, but my impression is that satellite products that use HDF5 file format also place arrays with inconsistent dimensions / coordinates in the same group.

shoyer commented 2 years ago

One thing that came up in our discussion about this in the developer meeting today is that we could also pretty easily expose a "low level" API for IO using dictionaries of xarray.Variable objects. This intermediate representation could be useful for cleaning up data into a form suitable for conversion into Dataset objects.

On Wed, Feb 16, 2022 at 11:39 PM Alessandro Amici @.***> wrote:

@TomNicholas https://github.com/TomNicholas (cc @mraspaud https://github.com/mraspaud)

Do you have use cases which one of these designs could handle but the other couldn't?

The two main classes of on-disk formats that, I know of, which cannot be always represented in the "group is a Dataset" approach are:

in netCDF following the CF conventions for groups https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#groups, it is legal for an array to refer to a dimension or a coordinate in a different group and so arrays in the same group may have dimensions with the same name, but different size / coordinate values,

the current spec for the Next-generation file formats (NGFF) https://ngff.openmicroscopy.org for bio-imaging has all scales of the same 5D data in the same group.

I don't have an example at hand, but my impression is that satellite products that use HDF5 file format also place arrays with inconsistent dimensions / coordinates in the same group.

— Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/4118#issuecomment-1042656377, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJJFVT27QD4RQDYZ2N4W7TU3SQ3BANCNFSM4NQEIKFQ . You are receiving this because you were mentioned.Message ID: @.***>

alexamici commented 2 years ago

@TomNicholas I also have a few comments on the comparison:

Option (1) - Each group is a Dataset

Model maps more directly onto netCDF (though still not exactly, because netCDF has dimensions as separate objects)

This is only true for flat netCDF files, once you introduce groups in a netCDF AND accept CF conventions the DataGroup approach can map 100% of the files, while the DataTree approach fails on a (admittedly small) class of them.

Enforcing consistency between variables guarantees certain operations are always well-defined (in particular selection via an integer index like in .isel).

Guarantees that all valid operations on a Dataset are also valid operations on a single group of a DataTree - so API can be essentially identical to Dataset.

Both points are only true for the DataArray in a single group, once you broadcast any operation to subgroups the two implementations would share the same limitations (dimensions in subgroups can be inconsistent in both cases).

In my opinion the advantage for the DataTree is minimal.

Metadata (i.e. .attrs) are arguably most useful when set at this level

The two approach are identical in this respect, group attributes are mapped in the same way to DataTree and DataGroup

I share your views on all other points.

kmuehlbauer commented 2 years ago

@alexamici

in netCDF following the CF conventions for groups, it is legal for an array to refer to a dimension or a coordinate in a different group and so arrays in the same group may have dimensions with the same name, but different size / coordinate values, (this was the orginal motivation to explore the DataGroup approach)

I'm having difficulties to understand your above point wrt to the scoping rules from the above CF document. Shouldn't it be impossible to create two arrays (in the same group) having dimensions with exactly the same name from different groups? Looking at the example here https://github.com/alexamici/xarray-datagroup there are coordinates with name "/lat" vs "lat". Aren't that two different names? Maybe I'm missing something essential here.

pydata / xarray

Feature Request: Hierarchical storage and processing in xarray #4118