pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.5k stars 1.04k forks source link

DataTree: Align `from_dict` and `to_dict` behaviours to their Dataset equivalents #9074

Open etienneschalk opened 1 month ago

etienneschalk commented 1 month ago

Is your feature request related to a problem?

This feature request arises from a "real-life" use case: I rely on Dataset.from_dict and Dataset.to_dict to convert Datasets to a dict before serializing them to JSON, and then loading back the JSON back to a Dataset with xarray.

JSON can be useful for small datasets, containing configuration with small values, that should be easily openable/modifiable by a human directly in a text editor, without using any library or script. Using xarray provide benefits as it solves questions like "how do I represent an array with coordinates in JSON": no need to reinvent super-languages above JSON, when the xarray serialization already does the job.

However, these capabilities do not exist (yet) for DataTree. It means that this "magic" method of using xarray as a way to dump to JSON is limited to flat structures.

Describe the solution you'd like

I would like the DataTree.from_dict and DataTree.to_dict to have a similar behaviour as their Dataset counterparts.

Currently the DataTree.from_dict method (https://xarray-datatree.readthedocs.io/en/stable/generated/datatree.DataTree.from_dict.html) expects A mapping from path names to xarray.Dataset, xarray.DataArray, or DataTree objects. It means a JSON cannot be reloaded back.

Currently the DataTree.to_dict method does not attempt to "serialize": the keys are paths and values are instance of Datasets. I would expect the Datasets to be replaced by their dictified version.

In [49]: xdt.to_dict()
Out[49]: 
{'/': <xarray.Dataset> 0B
 Dimensions:  ()
 Data variables:
     *empty*
 Attributes:
     top_level_attr:  Ho,
 '/parent': <xarray.Dataset> 96B
 Dimensions:  (dim_one: 3, dim_two: 2)
 Coordinates:
   * dim_one  (dim_one) int64 24B 10 20 30
 Dimensions without coordinates: dim_two
 Data variables:
     child_1  (dim_one) int64 24B 1 2 3
     child_2  (dim_two, dim_one) int64 48B 5 6 9 7 8 0}

The solution I would like resembled more this:

In [55]: datatree_dict = {path: xds.to_dict() for path, xds in xdt.to_dict().items()}

In [56]: datatree_dict
Out[56]: 
{'/': {'coords': {},
  'attrs': {'top_level_attr': 'Ho'},
  'dims': {},
  'data_vars': {}},
 '/parent': {'coords': {'dim_one': {'dims': ('dim_one',),
    'attrs': {},
    'data': [10, 20, 30]}},
  'attrs': {},
  'dims': {'dim_one': 3, 'dim_two': 2},
  'data_vars': {'child_1': {'dims': ('dim_one',),
    'attrs': {},
    'data': [1, 2, 3]},
   'child_2': {'dims': ('dim_two', 'dim_one'),
    'attrs': {'units': 'm', 'long_name': 'Hey'},
    'data': [[5, 6, 9], [7, 8, 0]]}}}}

In [58]: print(json.dumps(datatree_dict, indent=4))
{
    "/": {
        "coords": {},
        "attrs": {
            "top_level_attr": "Ho"
        },
        "dims": {},
        "data_vars": {}
    },
    "/parent": {
        "coords": {
            "dim_one": {
                "dims": [
                    "dim_one"
                ],
                "attrs": {},
                "data": [
                    10,
                    20,
                    30
                ]
            }
        },
        "attrs": {},
        "dims": {
            "dim_one": 3,
            "dim_two": 2
        },
        "data_vars": {
            "child_1": {
                "dims": [
                    "dim_one"
                ],
                "attrs": {},
                "data": [
                    1,
                    2,
                    3
                ]
            },
            "child_2": {
                "dims": [
                    "dim_two",
                    "dim_one"
                ],
                "attrs": {
                    "units": "m",
                    "long_name": "Hey"
                },
                "data": [
                    [
                        5,
                        6,
                        9
                    ],
                    [
                        7,
                        8,
                        0
                    ]
                ]
            }
        }
    }
}

Describe alternatives you've considered

Until now, I have been storing PurePosixPath-like variable names in Datasets. This helps organizing the configuration data, however, this loses the benefits of having scoped dimension names that DataTree provide.

Note: I did not want to add any custom parsing logic written by myself, not-standard and potentially breakable. The whole point of the from_dict and to_dict, to me, as I use them, is to be "universal-one-liners", a guarantee that an other xarray user can easily read the JSON I produced without writing themselves new parsing logic on their own.

Example:


In [31]: xds = xr.Dataset({'parent/child_1': xr.DataArray([1,2,3], coords={"dim_one": [10,20,30]}), "parent/child_2": xr.DataArray([[5,6,9],[7,8,0]], di
    ...: ms=("dim_two", "dim_one"), attrs={"units": "m", "long_name": "Hey"})}, attrs={"top_level_attr": "Ho"})

In [32]: xds.to_dict()
Out[32]: 
{'coords': {'dim_one': {'dims': ('dim_one',),
   'attrs': {},
   'data': [10, 20, 30]}},
 'attrs': {'top_level_attr': 'Ho'},
 'dims': {'dim_one': 3, 'dim_two': 2},
 'data_vars': {'parent/child_1': {'dims': ('dim_one',),
   'attrs': {},
   'data': [1, 2, 3]},
  'parent/child_2': {'dims': ('dim_two', 'dim_one'),
   'attrs': {'units': 'm', 'long_name': 'Hey'},
   'data': [[5, 6, 9], [7, 8, 0]]}}}

In [33]: print(json.dumps(xds.to_dict(), indent=4))
{
    "coords": {
        "dim_one": {
            "dims": [
                "dim_one"
            ],
            "attrs": {},
            "data": [
                10,
                20,
                30
            ]
        }
    },
    "attrs": {
        "top_level_attr": "Ho"
    },
    "dims": {
        "dim_one": 3,
        "dim_two": 2
    },
    "data_vars": {
        "parent/child_1": {
            "dims": [
                "dim_one"
            ],
            "attrs": {},
            "data": [
                1,
                2,
                3
            ]
        },
        "parent/child_2": {
            "dims": [
                "dim_two",
                "dim_one"
            ],
            "attrs": {
                "units": "m",
                "long_name": "Hey"
            },
            "data": [
                [
                    5,
                    6,
                    9
                ],
                [
                    7,
                    8,
                    0
                ]
            ]
        }
    }
}

In [41]: reloaded = xr.Dataset.from_dict(json.loads(json.dumps(xds.to_dict(), indent=4)))

In [42]: reloaded
Out[42]: 
<xarray.Dataset> 96B
Dimensions:         (dim_one: 3, dim_two: 2)
Coordinates:
  * dim_one         (dim_one) int64 24B 10 20 30
Dimensions without coordinates: dim_two
Data variables:
    parent/child_1  (dim_one) int64 24B 1 2 3
    parent/child_2  (dim_two, dim_one) int64 48B 5 6 9 7 8 0
Attributes:
    top_level_attr:  Ho

In [43]: import xarray.core.datatree as dt

In [44]: xdt = dt.DataTree()

In [45]: for varname in reloaded: xdt[varname] = reloaded[varname]

In [46]: xdt
Out[46]: 
DataTree('None', parent=None)
└── DataTree('parent')
        Dimensions:  (dim_one: 3, dim_two: 2)
        Coordinates:
          * dim_one  (dim_one) int64 24B 10 20 30
        Dimensions without coordinates: dim_two
        Data variables:
            child_1  (dim_one) int64 24B 1 2 3
            child_2  (dim_two, dim_one) int64 48B 5 6 9 7 8 0

Root-level attrs are lost but can be added again.

In [47]: xdt.attrs.update(xds.attrs)

In [48]: xdt
Out[48]: 
DataTree('None', parent=None)
│   Dimensions:  ()
│   Data variables:
│       *empty*
│   Attributes:
│       top_level_attr:  Ho
└── DataTree('parent')
        Dimensions:  (dim_one: 3, dim_two: 2)
        Coordinates:
          * dim_one  (dim_one) int64 24B 10 20 30
        Dimensions without coordinates: dim_two
        Data variables:
            child_1  (dim_one) int64 24B 1 2 3
            child_2  (dim_two, dim_one) int64 48B 5 6 9 7 8 0

Additional context

No response

etienneschalk commented 4 weeks ago

Reworked issue description with examples from the current implementation

Is your feature request related to a problem?

This feature request arises from the following use case: I rely on Dataset.to_dict to convert Datasets to dicts before serializing them to JSON, and Dataset.from_dict to then then load JSON files back into Datasets.

flowchart 
subgraph Serialization
  Dataset_in[Dataset] --> dict_in[dict] 
  dict_in[dict] --> JSON_out[JSON]
  end
flowchart
subgraph Deserialization 
  JSON_in[JSON] --> dict_out[dict]  
  dict_out[dict]  --> Dataset_out[Dataset]
  end

JSON can be useful for small datasets, containing configuration values for instance, that should be easily openable/modifiable by a human directly in a text editor, without using any external library or script. Using xarray's Dataset.from_dict and Dataset.to_dict methods provides an out-of-the-box answer to the following question: "How to persist and reload Datasets to and from JSON"? Using xarray also avoid using "raw JSON" to store configuration as it is often very error-prone and lack structure. So xarray provides more structure than raw JSON, while still allowing the flexibility (not having to define multiple schemas ; the only rule to follow is to have a JSON readable by xarray).

While these capabilities do exist yet for DataArrays and Datasets, they do not exist yet for DataTree. It means that currently, using xarray to read and write JSON limited to flat structures.

Describe the solution you'd like

I would like the DataTree.from_dict and DataTree.to_dict to have a similar behaviour as their Dataset counterparts.

Currently the DataTree.from_dict method (https://xarray-datatree.readthedocs.io/en/stable/generated/datatree.DataTree.from_dict.html) expects:

A mapping from path names to xarray.Dataset, xarray.DataArray, or DataTree objects.

It means a JSON cannot be reloaded back.

Currently the DataTree.to_dict method does not attempt to "serialize": the keys are paths and values are instance of xarray-related data structures. I would expect the Datasets to be replaced by their dictified version, following the same philosophy as the existing methods for Dataset. The existing methods are very useful, eg for creating test DataTrees, but its behaviour can be extended.

In the following code example:

Build an example DataTree

```python import pandas as pd import numpy as np import xarray as xr from xarray.core import datatree as dt xdt = dt.DataTree.from_dict( name="(root)", d={ "/": xr.Dataset( coords={ "time": xr.DataArray( data=pd.date_range(start="2020-12-01", end="2020-12-02", freq="D")[ :2 ], dims="time", attrs={ "units": "date", "long_name": "Time of acquisition", }, ) }, attrs={ "description": "Root Hypothetical DataTree with heterogeneous data: weather and satellite" }, ), "/weather_data": xr.Dataset( coords={ "station": xr.DataArray( data=list("abcdef"), dims="station", attrs={ "units": "dl", "long_name": "Station of acquisition", }, ) }, data_vars={ "wind_speed": xr.DataArray( np.ones((2, 6)) * 2, dims=("time", "station"), attrs={ "units": "meter/sec", "long_name": "Wind speed", }, ), "pressure": xr.DataArray( np.ones((2, 6)) * 3, dims=("time", "station"), attrs={ "units": "hectopascals", "long_name": "Time of acquisition", }, ), }, attrs={"description": "Weather data node, inheriting the 'time' dimension"}, ), "/weather_data/temperature": xr.Dataset( data_vars={ "air_temperature": xr.DataArray( np.ones((2, 6)) * 3, dims=("time", "station"), attrs={ "units": "kelvin", "long_name": "Air temperature", }, ), "dewpoint_temp": xr.DataArray( np.ones((2, 6)) * 4, dims=("time", "station"), attrs={ "units": "kelvin", "long_name": "Dew point temperature", }, ), }, attrs={ "description": ( "Temperature, subnode of the weather data node, " "inheriting the 'time' dimension from root and 'station' " "dimension from the Temperature group." ) }, ), "/satellite_image": xr.Dataset( coords={"x": [10, 20, 30], "y": [90, 80, 70]}, data_vars={ "infrared": xr.DataArray( np.ones((2, 3, 3)) * 5, dims=("time", "y", "x") ), "true_color": xr.DataArray( np.ones((2, 3, 3)) * 6, dims=("time", "y", "x") ), }, ), }, ) print(xdt) ```
DataTree('(root)', parent=None)
│   Dimensions:  (time: 2)
│   Coordinates:
│     * time     (time) datetime64[ns] 16B 2020-12-01 2020-12-02
│   Data variables:
│       *empty*
│   Attributes:
│       description:  Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│   │   Dimensions:     (station: 6, time: 2)
│   │   Coordinates:
│   │     * station     (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│   │   Dimensions without coordinates: time
│   │   Data variables:
│   │       wind_speed  (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│   │       pressure    (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│   │   Attributes:
│   │       description:  Weather data node, inheriting the 'time' dimension
│   └── DataTree('temperature')
│           Dimensions:          (time: 2, station: 6)
│           Dimensions without coordinates: time, station
│           Data variables:
│               air_temperature  (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
│               dewpoint_temp    (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
│           Attributes:
│               description:  Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
        Dimensions:     (x: 3, y: 3, time: 2)
        Coordinates:
          * x           (x) int64 24B 10 20 30
          * y           (y) int64 24B 90 80 70
        Dimensions without coordinates: time
        Data variables:
            infrared    (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
            true_color  (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0

Convert to dict with the existing DataTree.from_dict method:

xdt.to_dict()
```python {'/': Size: 16B Dimensions: (time: 2) Coordinates: * time (time) datetime64[ns] 16B 2020-12-01 2020-12-02 Data variables: *empty* Attributes: description: Root Hypothetical DataTree with heterogeneous data: weather..., '/weather_data': Size: 216B Dimensions: (station: 6, time: 2) Coordinates: * station (station) Size: 336B Dimensions: (x: 3, y: 3, time: 2) Coordinates: * x (x) int64 24B 10 20 30 * y (y) int64 24B 90 80 70 Dimensions without coordinates: time Data variables: infrared (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0 true_color (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0, '/weather_data/temperature': Size: 192B Dimensions: (time: 2, station: 6) Dimensions without coordinates: time, station Data variables: air_temperature (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 dewpoint_temp (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0 Attributes: description: Temperature, subnode of the weather data node, inheriting t...} ```

Convert to dict with the proposed DataTree.from_dict_nested method:

xdt.to_dict_nested()
```python {'coords': {'time': {'dims': ('time',), 'attrs': {'units': 'date', 'long_name': 'Time of acquisition'}, 'data': [datetime.datetime(2020, 12, 1, 0, 0), datetime.datetime(2020, 12, 2, 0, 0)]}}, 'attrs': {'description': 'Root Hypothetical DataTree with heterogeneous data: weather and satellite'}, 'dims': {'time': 2}, 'data_vars': {}, 'name': '(root)', 'children': {'weather_data': {'coords': {'station': {'dims': ('station',), 'attrs': {'units': 'dl', 'long_name': 'Station of acquisition'}, 'data': ['a', 'b', 'c', 'd', 'e', 'f']}}, 'attrs': {'description': "Weather data node, inheriting the 'time' dimension"}, 'dims': {'station': 6, 'time': 2}, 'data_vars': {'wind_speed': {'dims': ('time', 'station'), 'attrs': {'units': 'meter/sec', 'long_name': 'Wind speed'}, 'data': [[2.0, 2.0, 2.0, 2.0, 2.0, 2.0], [2.0, 2.0, 2.0, 2.0, 2.0, 2.0]]}, 'pressure': {'dims': ('time', 'station'), 'attrs': {'units': 'hectopascals', 'long_name': 'Time of acquisition'}, 'data': [[3.0, 3.0, 3.0, 3.0, 3.0, 3.0], [3.0, 3.0, 3.0, 3.0, 3.0, 3.0]]}}, 'name': 'weather_data', 'children': {'temperature': {'coords': {}, 'attrs': {'description': "Temperature, subnode of the weather data node, inheriting the 'time' dimension from root and 'station' dimension from the Temperature group."}, 'dims': {'time': 2, 'station': 6}, 'data_vars': {'air_temperature': {'dims': ('time', 'station'), 'attrs': {'units': 'kelvin', 'long_name': 'Air temperature'}, 'data': [[3.0, 3.0, 3.0, 3.0, 3.0, 3.0], [3.0, 3.0, 3.0, 3.0, 3.0, 3.0]]}, 'dewpoint_temp': {'dims': ('time', 'station'), 'attrs': {'units': 'kelvin', 'long_name': 'Dew point temperature'}, 'data': [[4.0, 4.0, 4.0, 4.0, 4.0, 4.0], [4.0, 4.0, 4.0, 4.0, 4.0, 4.0]]}}, 'name': 'temperature', 'children': {}}}}, 'satellite_image': {'coords': {'x': {'dims': ('x',), 'attrs': {}, 'data': [10, 20, 30]}, 'y': {'dims': ('y',), 'attrs': {}, 'data': [90, 80, 70]}}, 'attrs': {}, 'dims': {'x': 3, 'y': 3, 'time': 2}, 'data_vars': {'infrared': {'dims': ('time', 'y', 'x'), 'attrs': {}, 'data': [[[5.0, 5.0, 5.0], [5.0, 5.0, 5.0], [5.0, 5.0, 5.0]], [[5.0, 5.0, 5.0], [5.0, 5.0, 5.0], [5.0, 5.0, 5.0]]]}, 'true_color': {'dims': ('time', 'y', 'x'), 'attrs': {}, 'data': [[[6.0, 6.0, 6.0], [6.0, 6.0, 6.0], [6.0, 6.0, 6.0]], [[6.0, 6.0, 6.0], [6.0, 6.0, 6.0], [6.0, 6.0, 6.0]]]}}, 'name': 'satellite_image', 'children': {}}}} ```
print(json.dumps(xdt.to_dict_nested(), indent=4, default=str))
```python { "coords": { "time": { "dims": [ "time" ], "attrs": { "units": "date", "long_name": "Time of acquisition" }, "data": [ "2020-12-01 00:00:00", "2020-12-02 00:00:00" ] } }, "attrs": { "description": "Root Hypothetical DataTree with heterogeneous data: weather and satellite" }, "dims": { "time": 2 }, "data_vars": {}, "name": "(root)", "children": { "weather_data": { "coords": { "station": { "dims": [ "station" ], "attrs": { "units": "dl", "long_name": "Station of acquisition" }, "data": [ "a", "b", "c", "d", "e", "f" ] } }, "attrs": { "description": "Weather data node, inheriting the 'time' dimension" }, "dims": { "station": 6, "time": 2 }, "data_vars": { "wind_speed": { "dims": [ "time", "station" ], "attrs": { "units": "meter/sec", "long_name": "Wind speed" }, "data": [ [ 2.0, 2.0, 2.0, 2.0, 2.0, 2.0 ], [ 2.0, 2.0, 2.0, 2.0, 2.0, 2.0 ] ] }, "pressure": { "dims": [ "time", "station" ], "attrs": { "units": "hectopascals", "long_name": "Time of acquisition" }, "data": [ [ 3.0, 3.0, 3.0, 3.0, 3.0, 3.0 ], [ 3.0, 3.0, 3.0, 3.0, 3.0, 3.0 ] ] } }, "name": "weather_data", "children": { "temperature": { "coords": {}, "attrs": { "description": "Temperature, subnode of the weather data node, inheriting the 'time' dimension from root and 'station' dimension from the Temperature group." }, "dims": { "time": 2, "station": 6 }, "data_vars": { "air_temperature": { "dims": [ "time", "station" ], "attrs": { "units": "kelvin", "long_name": "Air temperature" }, "data": [ [ 3.0, 3.0, 3.0, 3.0, 3.0, 3.0 ], [ 3.0, 3.0, 3.0, 3.0, 3.0, 3.0 ] ] }, "dewpoint_temp": { "dims": [ "time", "station" ], "attrs": { "units": "kelvin", "long_name": "Dew point temperature" }, "data": [ [ 4.0, 4.0, 4.0, 4.0, 4.0, 4.0 ], [ 4.0, 4.0, 4.0, 4.0, 4.0, 4.0 ] ] } }, "name": "temperature", "children": {} } } }, "satellite_image": { "coords": { "x": { "dims": [ "x" ], "attrs": {}, "data": [ 10, 20, 30 ] }, "y": { "dims": [ "y" ], "attrs": {}, "data": [ 90, 80, 70 ] } }, "attrs": {}, "dims": { "x": 3, "y": 3, "time": 2 }, "data_vars": { "infrared": { "dims": [ "time", "y", "x" ], "attrs": {}, "data": [ [ [ 5.0, 5.0, 5.0 ], [ 5.0, 5.0, 5.0 ], [ 5.0, 5.0, 5.0 ] ], [ [ 5.0, 5.0, 5.0 ], [ 5.0, 5.0, 5.0 ], [ 5.0, 5.0, 5.0 ] ] ] }, "true_color": { "dims": [ "time", "y", "x" ], "attrs": {}, "data": [ [ [ 6.0, 6.0, 6.0 ], [ 6.0, 6.0, 6.0 ], [ 6.0, 6.0, 6.0 ] ], [ [ 6.0, 6.0, 6.0 ], [ 6.0, 6.0, 6.0 ], [ 6.0, 6.0, 6.0 ] ] ] } }, "name": "satellite_image", "children": {} } } ```

(minified version):

{"coords":{"time":{"dims":["time"],"attrs":{"units":"date","long_name":"Time of acquisition"},"data":["2020-12-01 00:00:00","2020-12-02 00:00:00"]}},"attrs":{"description":"Root Hypothetical DataTree with heterogeneous data: weather and satellite"},"dims":{"time":2},"data_vars":{},"name":"(root)","children":{"weather_data":{"coords":{"station":{"dims":["station"],"attrs":{"units":"dl","long_name":"Station of acquisition"},"data":["a","b","c","d","e","f"]}},"attrs":{"description":"Weather data node, inheriting the 'time' dimension"},"dims":{"station":6,"time":2},"data_vars":{"wind_speed":{"dims":["time","station"],"attrs":{"units":"meter/sec","long_name":"Wind speed"},"data":[[2,2,2,2,2,2],[2,2,2,2,2,2]]},"pressure":{"dims":["time","station"],"attrs":{"units":"hectopascals","long_name":"Time of acquisition"},"data":[[3,3,3,3,3,3],[3,3,3,3,3,3]]}},"name":"weather_data","children":{"temperature":{"coords":{},"attrs":{"description":"Temperature, subnode of the weather data node, inheriting the 'time' dimension from root and 'station' dimension from the Temperature group."},"dims":{"time":2,"station":6},"data_vars":{"air_temperature":{"dims":["time","station"],"attrs":{"units":"kelvin","long_name":"Air temperature"},"data":[[3,3,3,3,3,3],[3,3,3,3,3,3]]},"dewpoint_temp":{"dims":["time","station"],"attrs":{"units":"kelvin","long_name":"Dew point temperature"},"data":[[4,4,4,4,4,4],[4,4,4,4,4,4]]}},"name":"temperature","children":{}}}},"satellite_image":{"coords":{"x":{"dims":["x"],"attrs":{},"data":[10,20,30]},"y":{"dims":["y"],"attrs":{},"data":[90,80,70]}},"attrs":{},"dims":{"x":3,"y":3,"time":2},"data_vars":{"infrared":{"dims":["time","y","x"],"attrs":{},"data":[[[5,5,5],[5,5,5],[5,5,5]],[[5,5,5],[5,5,5],[5,5,5]]]},"true_color":{"dims":["time","y","x"],"attrs":{},"data":[[[6,6,6],[6,6,6],[6,6,6]],[[6,6,6],[6,6,6],[6,6,6]]]}},"name":"satellite_image","children":{}}}}

(screenshot of the minified version):

Screenshot from 2024-06-09 13-36-14

Load back to DataTree:

dt.DataTree.from_dict_nested(json.loads(json.dumps(xdt.to_dict_nested(), indent=4, default=str)))
DataTree('(root)', parent=None)
│   Dimensions:  (time: 2)
│   Coordinates:
│     * time     (time) <U19 152B '2020-12-01 00:00:00' '2020-12-02 00:00:00'
│   Data variables:
│       *empty*
│   Attributes:
│       description:  Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│   │   Dimensions:     (station: 6, time: 2)
│   │   Coordinates:
│   │     * station     (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│   │   Dimensions without coordinates: time
│   │   Data variables:
│   │       wind_speed  (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│   │       pressure    (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│   │   Attributes:
│   │       description:  Weather data node, inheriting the 'time' dimension
│   └── DataTree('temperature')
│           Dimensions:          (time: 2, station: 6)
│           Dimensions without coordinates: time, station
│           Data variables:
│               air_temperature  (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
│               dewpoint_temp    (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
│           Attributes:
│               description:  Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
        Dimensions:     (x: 3, y: 3, time: 2)
        Coordinates:
          * x           (x) int64 24B 10 20 30
          * y           (y) int64 24B 90 80 70
        Dimensions without coordinates: time
        Data variables:
            infrared    (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
            true_color  (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0

Remark: the time dimension is downgraded to a str as it is not JSON serializable. The scope of this feature request is to focus on the DataTree -> dict and dict -> DataTree. Any serialization not supported by default by JSON is the responsibility of the user to deal with (as it is currently with Dataset.to_dict and Dataset.from_dict).

However, the round-trip DataTree -> dict -> DataTree is guaranteed:

dt.DataTree.from_dict_nested(xdt.to_dict_nested())
DataTree('(root)', parent=None)
│   Dimensions:  (time: 2)
│   Coordinates:
│     * time     (time) datetime64[ns] 16B 2020-12-01 2020-12-02
│   Data variables:
│       *empty*
│   Attributes:
│       description:  Root Hypothetical DataTree with heterogeneous data: weather...
├── DataTree('weather_data')
│   │   Dimensions:     (station: 6, time: 2)
│   │   Coordinates:
│   │     * station     (station) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
│   │   Dimensions without coordinates: time
│   │   Data variables:
│   │       wind_speed  (time, station) float64 96B 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0
│   │       pressure    (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0
│   │   Attributes:
│   │       description:  Weather data node, inheriting the 'time' dimension
│   └── DataTree('temperature')
│           Dimensions:          (time: 2, station: 6)
│           Dimensions without coordinates: time, station
│           Data variables:
│               air_temperature  (time, station) float64 96B 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0
│               dewpoint_temp    (time, station) float64 96B 4.0 4.0 4.0 4.0 ... 4.0 4.0 4.0
│           Attributes:
│               description:  Temperature, subnode of the weather data node, inheriting t...
└── DataTree('satellite_image')
        Dimensions:     (x: 3, y: 3, time: 2)
        Coordinates:
          * x           (x) int64 24B 10 20 30
          * y           (y) int64 24B 90 80 70
        Dimensions without coordinates: time
        Data variables:
            infrared    (time, y, x) float64 144B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0 5.0
            true_color  (time, y, x) float64 144B 6.0 6.0 6.0 6.0 ... 6.0 6.0 6.0 6.0
TomNicholas commented 3 weeks ago

Thanks for raising this @etienneschalk !

I generally agree that methods for going to/from JSON would be generally useful, and that the methods should be consistent across Dataset/DataTree, but the .to_dict and .from_dict methods on DataTree are quite important and IMO shouldn't be changed. (They matter because many internal operations are currently implemented by first turning the DataTree into a dict, manipulating it, then turning the altered dict back into a DataTree.)

Instead I suggest we simply add new methods for your use case: .to_json_dict (or perhaps some other name) and .from_json_dict. The existing .to_dict method on Dataset should be aliases to point to the new method, with a deprecation warning raised.

Another option might be to add a json=True kwarg to .to_dict or similar to switch between the two behaviours.

etienneschalk commented 3 weeks ago

Hello @TomNicholas ,

I kept and renamed the existing to_dict and from_dict method of datatree to from_paths_dict and to_paths_dict as they are mappings of string paths to xarray's data structures ; they can still be used in the internal code.

The existing Dataset.to_dict removes entirely any trace of xarray's data structures, and do convert to native python data structures: dicts and lists, that are more easily serializable to JSON. My implementation relies a lot on reusing the Dataset.to_dict method itself, with the logic being pretty lite.

Rather than renaming the existing older Dataset.to_dict method, would it be possible to make a change of API in datatree while it is still not yet fully integrated into xarray and changes like this are more acceptable?

Regarding a switch, the only issue I see with an argument like json=True would be for typing: the to_dict method would now return a union of two types, and this can be annoying for users (the burden of type narrowing is passed onto the user).

etienneschalk commented 2 weeks ago

Regarding

Another option might be to add a json=True kwarg to .to_dict or similar to switch between the two behaviours.

I saw it is possible to define the return type of a function based on a boolean flag (https://github.com/python/mypy/issues/8634), so it might be possible to have both behaviours, with the same function name, only changing the flag. The default would remain the exising behaviour of datatree's from_dict and to_dict since it is already in use. I can propose native as a flag, as it really converts xarray datastructures to native python ones, easilier serializable to JSON (but it does not produce JSON directly).

Edit: I'm fine with just having to_native_dict and from_native_dict.