Improve how we store metadata

sfinkens commented 9 months ago

Feature Request

Is your feature request related to a problem? Please describe.

We repeatedly face encoding/decoding problems in the CF writer, so I have been thinking a bit about how we store metadata and possible alternatives.

Pros

Currently we store metadata in dataset attributes. This is very convenient, because we can just attach any Python type (string, array, dict etc) to a dataset.

Cons

Over time attributes have become more complex: From a simple sensor name to orbital parameters, time parameters, calibration coefficients etc. Now we're experiencing some problems with that approach

There's no way to attach dimensions to attributes, for example time-dependent attributes . So these attributes are dropped when concatenating datasets along the time dimension.
There's no (xarray-native) way to attach units to attributes
Encoding/decoding is complicated, especially when we want to achieve a smooth roundtrip. For example: Encode dataset attribute to netCDF attribute, then decode back to original Python type. A significant part of the CF reader/writer is about attribute encoding/decoding.

Describe the solution you'd like

I think many attributes could be stored as Data Arrays or Datasets. Then we could leverage the full potential of the xarray data model combined with CF conventions. We would still be using datasets attributes, but mainly strings (such as units, calendar etc).

When we are moving towards data trees (https://github.com/pytroll/satpy/issues/2605), we could organize them in a "metadata" branch (either global or dataset-specific). For example:

DataTree('None', parent=None)
├── DataTree('ch1')
│   ├── DataTree('data')
│   │       Dimensions:  (time: 1, y: 2, x: 2)
│   │       Data variables:
│   │           ch1      (time, y, x) int64 1 2 3 4
│   └── DataTree('metadata')
│       └── DataTree('calibration_coefficients')
│               Dimensions:  (time: 1)
│               Data variables:
│                   gain     (time) float64 0.1
|                       Attributes:
|                           units:    W/m2
│                   offset   (time) int64 -51
|                       Attributes:
|                           units:    1
└── DataTree('metadata')
    ├── DataTree('orbital_parameters')
    │       Dimensions:                     (time: 1)
    │       Dimensions without coordinates: time
    │       Data variables:
    │           projection_longitude        (time) int64 0
    |               Attributes:
    |                   units:    degrees east
    │           satellite_actual_longitude  (time) float64 -0.01
    |               Attributes:
    |                   units:    degrees east
    └── DataTree('time_parameters')
            Dimensions:                 (time: 1)
            Dimensions without coordinates: time
            Data variables:
                nominal_start_time      (time) int64 0
                    Attributes:
                        units:    nanoseconds since 1970-01-01
                observation_start_time  (time) int64 1
                    Attributes:
                        units:    nanoseconds since 1970-01-01

Describe any changes to existing user workflow Users would have to adapt to the new data structure.

Additional context For units, we could use pint already.

djhoese commented 9 months ago

As an additional option for some of these fields, I wonder if we could store them as coordinate variables?

I'm not a huge fan of the "metadata" DataTree, but I also really don't have a good alternative except for maybe coordinate variables. Just so we have the whole picture, besides units can you think of other attributes that these metadata "things" need to keep track of?

In some sense this feels like a failing of the NetCDF model. Where we might want units or other information for attributes we can't do it. I feel like most NetCDF creators have leaned towards external documentation that describes the units of attributes. Not that I want to necessarily do that, but could that simplify this if we "forced" that all distances are in meters, all times are epoch nanoseconds, etc?

sfinkens commented 9 months ago

Indeed, coordinate variables would work, nice idea!

<xarray.Dataset>
Dimensions:               (time: 1, y: 2, x: 2, tle_dim: 2)
Coordinates:
  * time                  (time) int64 0
  * y                     (y) int64 0 1
  * x                     (x) int64 0 1
    projection_longitude  (time) int64 0
        Attributes:
            units:    degrees east
    nominal_start_time    (time) int64 0
        Attributes:
            units:    nanoseconds since 1970-01-01
    two_line_elements     (time, tle_dim) int64 1 2
Data variables:
    ch1                   (time, y, x) int64 1 2 3 4

Just so we have the whole picture, besides units can you think of other attributes that these metadata "things" need to keep track of?

Looking at the Metadata section in the documentation, there are also attributes for area definition and raw metadata. The area def can be stored as coordinate variable, too. And raw metadata is probably too special...

Not that I want to necessarily do that, but could that simplify this if we "forced" that all distances are in meters, all times are epoch nanoseconds, etc?

This would certainly help with units, but not with dimensions or encoding.

mraspaud commented 8 months ago

I think it's a good idea to improve the metadata storage indeed.

Coordinate variables are good for some things. For example, in rioxarray, projection information is store in a spatial_ref coordinate variable, which I think is interesting. As we have it now, its info is equivalent to what we have in our area attribute, but we could use the same idea for other info, eg we could have an orbital_ref for everything related to orbital information.

But then comes the question of what really belongs as a coordinate variable. I see coordinate variables as variables that are related in one way or another to dimension/coordinates, hence eg calibration coefficients would not really fit there.

In the first example you provide, the gain and offset are aligned to the time dimension, is this intentional, or should they be dimensionless?

sfinkens commented 8 months ago

calibration coefficients would not really fit there

Good point.

In the first example you provide, the gain and offset are aligned to the time dimension, is this intentional, or should they be dimensionless?

I wanted to indicate that coefficients can be time-dependent, but it isn't strictly needed. xr.concat(datasets, dim="time") also produces a timeseries for scalars.

pytroll / satpy