pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

Adding CDL Parser/`open_cdl`? #6269

Open nbren12 opened 2 years ago

nbren12 commented 2 years ago

Is your feature request related to a problem?

No.

Describe the solution you'd like

It would be nice to load/generate xarray datasets from Common Data Language (CDL) descriptions. CDL is a DSL that that defines a netCDF dataset, and is quite nice for testing. We use it to build mock datasets for e.g. integration testing of plotting routines/complex data analysis etc. CDL provides a concise format for storing the schema of this data. This schema can be used for validation or generation (using the CLI ncgen).

CDL is basically the format produced by xarray.Dataset.info. It looks like this:

  netcdf example {   // example of CDL notation
  dimensions:
      lon = 3 ;
      lat = 8 ;
  variables:
      float rh(lon, lat) ;
          rh:units = "percent" ;
          rh:long_name = "Relative humidity" ;
  // global attributes
      :title = "Simple example, lacks some conventions" ;
  data:
 /// optional ...ncgen will still build 
   rh =
    2, 3, 5, 7, 11, 13, 17, 19,
    23, 29, 31, 37, 41, 43, 47, 53,
    59, 61, 67, 71, 73, 79, 83, 89 ;
  }

I wrote a small pure python parser for CDL last night and it seems work! There are similar projects on github. Sadly, these projects seem to be abandoned so it would be nice to attach to an effort like xarray.

Describe alternatives you've considered

Some kind of schema object that can be used to validate or generate an xarray Dataset, but does not contain any data.

Additional context

No response

kmuehlbauer commented 2 years ago

@nbren12 The other way round would be useful too. Aren't there xarray extension packages around where this would fit into?

nbren12 commented 2 years ago

Aren't there xarray extension packages around where this would fit into?

I'm not sure. Any suggestions? Just wondering if xarray has left the door open to this kind of contribution since it

  1. already supports other i/o backends
  2. creates CDL using ds.info().
dcherian commented 2 years ago

@nbren12 See https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html for adding a new backend. That way we could have

xr.open_dataset('schema.cdl', engine="cdl")
kmuehlbauer commented 2 years ago
  1. creates CDL using ds.info().

Great, this somehow went past me.

nbren12 commented 2 years ago

To be fair, ds.info is not 100% CDL, but it's darn close.

jhamman commented 2 years ago

To be fair, ds.info is not 100% CDL, but it's darn close.

I think making ds.info CDL compliant would be a great feature addition.

Describe alternatives you've considered

Some kind of schema object that can be used to validate or generate an xarray Dataset, but does not contain any data.

You may be interested in xarray-schema then. We're actively working on / using this project and would be more than happy to think about how a cdl-like schema fits in there.

nbren12 commented 2 years ago

@jhamman We have a similar schema package https://github.com/ai2cm/fv3net/tree/master/external/synth, cool to see you confronting the same challenges and advertising your solutions more broadly. One problem we had is that our schema objects ended up being quite verbose: https://github.com/ai2cm/fv3net/blob/master/external/loaders/tests/test__batch/one_step_zarr_schema.json. CDL is a lot more concise.