pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

Shaping the future of Backends #8548

Open headtr1ck opened 9 months ago

headtr1ck commented 9 months ago

What is your issue?

Backends in xarray are used to read and write files (or in general objects) and transform them into useful xarray Datasets.

This issue will collect ideas on how to continuously improve them.

Current state

Along the reading and writing process there are many implicit and explicit configuration possibilities. There are many backend specific options and many en-,decoder specific options. Most of them are currently difficult or even impossible to discover.

There is the infamous open_dataset method which can do everything, but there are also some specialized methods like open_zarr or to_netcdf.

The only really formalized way to extend xarray capabilities is via the BackendEntrypoint. Currently only for reading files. This has proven to work and things are going so well that people are discussing getting rid of the special reading methods (#7495). A major critique in this thread is again the discoverability of configuration options.

Problems

To name a few:

What already improved

The future

After listing all the problems, lets see how we can improve the situation and make backends an allrounder solution to reading and writing all kinds of files.

What happens behind the scenes

In general the reading and writing of Datasets in xarray is a three-step process.

                       [ done by backend.open_dataset]
Dataset < chunking   < decoding < opening_in_store < file
Dataset > validating > encoding > storing_in_store > file

Probably you could consider combining the chunking and decoding as well as validation and encoding into a single logical step in the pipeline. This view should help decide how to set up a future architecture of backends.

You can see that there is a common middle object in this process, a in-memory representation of the file on disc between en-, decoding and the abstract store. This is actually a xarray.Dataset and is internally called a "backend dataset".

write_dataset method

A quite natural extension of backends would be to implement a write_dataset method (name pending). This would allow backends to fulfill the complete right side of the pipeline.

Transformer class

Due to a lack of a common word for a class that handles "encoding" and "decoding" I will call them transformer here.

The process of en- and decoding is currently done "hardcoded" by the respective open_dataset and to_netcdf methods. One could imagine to introduce the concept of a common class that handles both.

This class could handle the implemented CF or netcdf encoding conventions. But it would also allow users to define their own storing conventions (Why not create a custom transformer that adds indexes based on variable attributes?) The possibilities are endless, and an interface that fulfills all the requirements still has to be found.

This would homogenize the reading and writing process to

Dataset <> Transformer <> Backend <> file

As a bonus this would increase discoverability of the configuration options of the decoding options (then transformer arguments).

The new interface then could be

backend = Netcdf4BackendEntrypoint(group="data")
decoder = CFTransformer(cftime=True)
ds = xr.open_dataset("file.nc", engine=backend, decoder=decoder)

while of course still allowing to pass all options simply as kwarg (since this is still the easiest way of telling beginners how to open files)

The final improvement here would be to add additional entrypoints for these transformers ;)

Disclaimer

Now this issue is just a bunch of random ideas that require quite some refinement or they might even turn out to be nonsense. So lets have a exciting discussion about these things :) If you have something to add to the above points I will include your ideas as well. This is meant as a collection of ideas on how to improve our backends :)

keewis commented 9 months ago

see also #5954 for a previous discussion of the write_dataset idea (the name I proposed there was xr.save_dataset to be symmetric with save_mfdataset)

TomNicholas commented 9 months ago

Due to a lack of a common word for a class that handles "encoding" and "decoding" I will call them transformer here.

The process of en- and decoding is currently done "hardcoded" by the respective open_dataset and to_netcdf methods. One could imagine to introduce the concept of a common class that handles both.

This class could handle the implemented CF or netcdf encoding conventions.

Doesn't this already exist as xarray.coding.VariableCoder? It has .encode and .decode methods. Are we basically just taking about making it public and allowing users to pass in custom subclasses of VariableCoder, and generalizing xarray.conventions to be configurable for non-CF cases?

(Why not create a custom transformer that adds indexes based on variable attributes?)

On the other hand this suggestion seems to be something that could not be immediately handled by the current VariableCoder design.

dcherian commented 9 months ago

Agree that these "transformers" are called "coders" ATM, linking this quite old proposal! https://github.com/pydata/xarray/issues/155

TomNicholas commented 1 month ago

Can these transformers/coders just be new zarr codecs? Exposing xarray's decoding logic in a way that follows that interface would allow for zarr to become a "universal reader" - see https://github.com/zarr-developers/zarr-specs/issues/303.