Open etienneschalk opened 1 month ago
Reworked issue description with examples from the current implementation
This feature request arises from the following use case: I rely on Dataset.to_dict
to convert Datasets to dicts before serializing them to JSON, and Dataset.from_dict
to then then load JSON files back into Datasets.
flowchart
subgraph Serialization
Dataset_in[Dataset] --> dict_in[dict]
dict_in[dict] --> JSON_out[JSON]
end
flowchart
subgraph Deserialization
JSON_in[JSON] --> dict_out[dict]
dict_out[dict] --> Dataset_out[Dataset]
end
JSON can be useful for small datasets, containing configuration values for instance, that should be easily openable/modifiable by a human directly in a text editor, without using any external library or script. Using xarray's Dataset.from_dict
and Dataset.to_dict
methods provides an out-of-the-box answer to the following question: "How to persist and reload Datasets to and from JSON"? Using xarray also avoid using "raw JSON" to store configuration as it is often very error-prone and lack structure. So xarray provides more structure than raw JSON, while still allowing the flexibility (not having to define multiple schemas ; the only rule to follow is to have a JSON readable by xarray).
While these capabilities do exist yet for DataArrays and Datasets, they do not exist yet for DataTree. It means that currently, using xarray to read and write JSON limited to flat structures.
I would like the DataTree.from_dict
and DataTree.to_dict
to have a similar behaviour as their Dataset
counterparts.
Currently the DataTree.from_dict
method (https://xarray-datatree.readthedocs.io/en/stable/generated/datatree.DataTree.from_dict.html) expects:
A mapping from path names to xarray.Dataset, xarray.DataArray, or DataTree objects.
It means a JSON cannot be reloaded back.
Currently the DataTree.to_dict
method does not attempt to "serialize": the keys are paths and values are instance of xarray-related data structures. I would expect the Datasets to be replaced by their dictified version, following the same philosophy as the existing methods for Dataset. The existing methods are very useful, eg for creating test DataTrees, but its behaviour can be extended.
In the following code example:
dt.DataTree.from_dict
dt.DataTree.to_dict
dt.DataTree.to_dict_nested
. It is then converted to JSON, and reloaded back to a DataTree with the complementary method DataTree.from_dict_nested
Build an example DataTree
DataTree('(root)', parent=None)
│ Dimensions: (time: 2)
│ Coordinates:
│ * time (time) datetime64[ns] 16B 2020-12-01 2020-12-02
│ Data variables:
│ *empty*
│ Attributes:
│ description: Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│ │ Dimensions: (station: 6, time: 2)
│ │ Coordinates:
│ │ * station (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│ │ Dimensions without coordinates: time
│ │ Data variables:
│ │ wind_speed (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│ │ pressure (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│ │ Attributes:
│ │ description: Weather data node, inheriting the 'time' dimension
│ └── DataTree('temperature')
│ Dimensions: (time: 2, station: 6)
│ Dimensions without coordinates: time, station
│ Data variables:
│ air_temperature (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
│ dewpoint_temp (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
│ Attributes:
│ description: Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
Dimensions: (x: 3, y: 3, time: 2)
Coordinates:
* x (x) int64 24B 10 20 30
* y (y) int64 24B 90 80 70
Dimensions without coordinates: time
Data variables:
infrared (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
true_color (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0
Convert to dict with the existing DataTree.from_dict
method:
xdt.to_dict()
Convert to dict with the proposed DataTree.from_dict_nested
method:
xdt.to_dict_nested()
print(json.dumps(xdt.to_dict_nested(), indent=4, default=str))
(minified version):
{"coords":{"time":{"dims":["time"],"attrs":{"units":"date","long_name":"Time of acquisition"},"data":["2020-12-01 00:00:00","2020-12-02 00:00:00"]}},"attrs":{"description":"Root Hypothetical DataTree with heterogeneous data: weather and satellite"},"dims":{"time":2},"data_vars":{},"name":"(root)","children":{"weather_data":{"coords":{"station":{"dims":["station"],"attrs":{"units":"dl","long_name":"Station of acquisition"},"data":["a","b","c","d","e","f"]}},"attrs":{"description":"Weather data node, inheriting the 'time' dimension"},"dims":{"station":6,"time":2},"data_vars":{"wind_speed":{"dims":["time","station"],"attrs":{"units":"meter/sec","long_name":"Wind speed"},"data":[[2,2,2,2,2,2],[2,2,2,2,2,2]]},"pressure":{"dims":["time","station"],"attrs":{"units":"hectopascals","long_name":"Time of acquisition"},"data":[[3,3,3,3,3,3],[3,3,3,3,3,3]]}},"name":"weather_data","children":{"temperature":{"coords":{},"attrs":{"description":"Temperature, subnode of the weather data node, inheriting the 'time' dimension from root and 'station' dimension from the Temperature group."},"dims":{"time":2,"station":6},"data_vars":{"air_temperature":{"dims":["time","station"],"attrs":{"units":"kelvin","long_name":"Air temperature"},"data":[[3,3,3,3,3,3],[3,3,3,3,3,3]]},"dewpoint_temp":{"dims":["time","station"],"attrs":{"units":"kelvin","long_name":"Dew point temperature"},"data":[[4,4,4,4,4,4],[4,4,4,4,4,4]]}},"name":"temperature","children":{}}}},"satellite_image":{"coords":{"x":{"dims":["x"],"attrs":{},"data":[10,20,30]},"y":{"dims":["y"],"attrs":{},"data":[90,80,70]}},"attrs":{},"dims":{"x":3,"y":3,"time":2},"data_vars":{"infrared":{"dims":["time","y","x"],"attrs":{},"data":[[[5,5,5],[5,5,5],[5,5,5]],[[5,5,5],[5,5,5],[5,5,5]]]},"true_color":{"dims":["time","y","x"],"attrs":{},"data":[[[6,6,6],[6,6,6],[6,6,6]],[[6,6,6],[6,6,6],[6,6,6]]]}},"name":"satellite_image","children":{}}}}
(screenshot of the minified version):
Load back to DataTree:
dt.DataTree.from_dict_nested(json.loads(json.dumps(xdt.to_dict_nested(), indent=4, default=str)))
DataTree('(root)', parent=None)
│ Dimensions: (time: 2)
│ Coordinates:
│ * time (time) <U19 152B '2020-12-01 00:00:00' '2020-12-02 00:00:00'
│ Data variables:
│ *empty*
│ Attributes:
│ description: Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│ │ Dimensions: (station: 6, time: 2)
│ │ Coordinates:
│ │ * station (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│ │ Dimensions without coordinates: time
│ │ Data variables:
│ │ wind_speed (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│ │ pressure (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│ │ Attributes:
│ │ description: Weather data node, inheriting the 'time' dimension
│ └── DataTree('temperature')
│ Dimensions: (time: 2, station: 6)
│ Dimensions without coordinates: time, station
│ Data variables:
│ air_temperature (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
│ dewpoint_temp (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
│ Attributes:
│ description: Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
Dimensions: (x: 3, y: 3, time: 2)
Coordinates:
* x (x) int64 24B 10 20 30
* y (y) int64 24B 90 80 70
Dimensions without coordinates: time
Data variables:
infrared (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
true_color (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0
Remark: the time dimension is downgraded to a str as it is not JSON serializable. The scope of this feature request is to focus on the DataTree -> dict
and dict -> DataTree
. Any serialization not supported by default by JSON is the responsibility of the user to deal with (as it is currently with Dataset.to_dict
and Dataset.from_dict
).
However, the round-trip DataTree -> dict -> DataTree
is guaranteed:
dt.DataTree.from_dict_nested(xdt.to_dict_nested())
DataTree('(root)', parent=None)
│ Dimensions: (time: 2)
│ Coordinates:
│ * time (time) datetime64[ns] 16B 2020-12-01 2020-12-02
│ Data variables:
│ *empty*
│ Attributes:
│ description: Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│ │ Dimensions: (station: 6, time: 2)
│ │ Coordinates:
│ │ * station (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│ │ Dimensions without coordinates: time
│ │ Data variables:
│ │ wind_speed (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│ │ pressure (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│ │ Attributes:
│ │ description: Weather data node, inheriting the 'time' dimension
│ └── DataTree('temperature')
│ Dimensions: (time: 2, station: 6)
│ Dimensions without coordinates: time, station
│ Data variables:
│ air_temperature (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
│ dewpoint_temp (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
│ Attributes:
│ description: Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
Dimensions: (x: 3, y: 3, time: 2)
Coordinates:
* x (x) int64 24B 10 20 30
* y (y) int64 24B 90 80 70
Dimensions without coordinates: time
Data variables:
infrared (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
true_color (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0
Thanks for raising this @etienneschalk !
I generally agree that methods for going to/from JSON would be generally useful, and that the methods should be consistent across Dataset
/DataTree
, but the .to_dict
and .from_dict
methods on DataTree
are quite important and IMO shouldn't be changed. (They matter because many internal operations are currently implemented by first turning the DataTree
into a dict, manipulating it, then turning the altered dict back into a DataTree
.)
Instead I suggest we simply add new methods for your use case: .to_json_dict
(or perhaps some other name) and .from_json_dict
. The existing .to_dict
method on Dataset
should be aliases to point to the new method, with a deprecation warning raised.
Another option might be to add a json=True
kwarg to .to_dict
or similar to switch between the two behaviours.
Hello @TomNicholas ,
I kept and renamed the existing to_dict
and from_dict
method of datatree to from_paths_dict
and to_paths_dict
as they are mappings of string paths to xarray's data structures ; they can still be used in the internal code.
The existing Dataset.to_dict
removes entirely any trace of xarray's data structures, and do convert to native python data structures: dicts and lists, that are more easily serializable to JSON. My implementation relies a lot on reusing the Dataset.to_dict
method itself, with the logic being pretty lite.
Rather than renaming the existing older Dataset.to_dict
method, would it be possible to make a change of API in datatree while it is still not yet fully integrated into xarray and changes like this are more acceptable?
Regarding a switch, the only issue I see with an argument like json=True
would be for typing: the to_dict
method would now return a union of two types, and this can be annoying for users (the burden of type narrowing is passed onto the user).
Regarding
Another option might be to add a json=True kwarg to .to_dict or similar to switch between the two behaviours.
I saw it is possible to define the return type of a function based on a boolean flag (https://github.com/python/mypy/issues/8634), so it might be possible to have both behaviours, with the same function name, only changing the flag. The default would remain the exising behaviour of datatree's from_dict and to_dict since it is already in use. I can propose native
as a flag, as it really converts xarray datastructures to native python ones, easilier serializable to JSON (but it does not produce JSON directly).
Edit: I'm fine with just having to_native_dict
and from_native_dict
.
Is your feature request related to a problem?
This feature request arises from a "real-life" use case: I rely on
Dataset.from_dict
andDataset.to_dict
to convert Datasets to a dict before serializing them to JSON, and then loading back the JSON back to a Dataset with xarray.JSON can be useful for small datasets, containing configuration with small values, that should be easily openable/modifiable by a human directly in a text editor, without using any library or script. Using xarray provide benefits as it solves questions like "how do I represent an array with coordinates in JSON": no need to reinvent super-languages above JSON, when the xarray serialization already does the job.
However, these capabilities do not exist (yet) for DataTree. It means that this "magic" method of using xarray as a way to dump to JSON is limited to flat structures.
Describe the solution you'd like
I would like the
DataTree.from_dict
andDataTree.to_dict
to have a similar behaviour as theirDataset
counterparts.Currently the
DataTree.from_dict
method (https://xarray-datatree.readthedocs.io/en/stable/generated/datatree.DataTree.from_dict.html) expects A mapping from path names to xarray.Dataset, xarray.DataArray, or DataTree objects. It means a JSON cannot be reloaded back.Currently the
DataTree.to_dict
method does not attempt to "serialize": the keys are paths and values are instance of Datasets. I would expect the Datasets to be replaced by their dictified version.The solution I would like resembled more this:
Describe alternatives you've considered
Until now, I have been storing PurePosixPath-like variable names in Datasets. This helps organizing the configuration data, however, this loses the benefits of having scoped dimension names that DataTree provide.
Note: I did not want to add any custom parsing logic written by myself, not-standard and potentially breakable. The whole point of the
from_dict
andto_dict
, to me, as I use them, is to be "universal-one-liners", a guarantee that an other xarray user can easily read the JSON I produced without writing themselves new parsing logic on their own.Example:
Root-level attrs are lost but can be added again.
Additional context
No response