spencerahill / aospy

Python package for automated analysis and management of gridded climate data
Apache License 2.0
84 stars 13 forks source link

Add support for zarr #316

Open spencerahill opened 5 years ago

spencerahill commented 5 years ago

Zarr is becoming the format of choice for N-D data on the cloud, with heavy usage in e.g. pangeo. @spencerkclark also has found some compelling use cases for it over netCDF on the GFDL analysis cluster, i.e. not on the cloud. And xarray has a very clean interface for zarr IO: Dataset.to_zarr and xr.open_zarr.

As such and based on offline conversations with @rabernat and @spencerkclark regarding using aospy within pangeo, I think it makes sense for aospy to provide zarr support. So it remains to decide how to do that.

The open_zarr method returns a Dataset object just as open_mfdataset and open_dataset do. So it's really purely a matter of I/O: once a Dataset is loaded from a zarr file, we can proceed with the rest of our pipeline just as if it were a netCDF file (hooray for clean interfaces!)

For input, then, insofar as users save their zarr files with the .zarr extension, we can simply use that to choose which method we use to load: e.g. something like

if filename.endswith('.zarr'):
    open_method = xr.open_zarr
elif filename.endswith('.nc'):
    open_method = xr.open_mfdataset

There are most likely some additional complications I haven't thought of yet, but this seems like a reasonable approach at this stage. And/or we could allow the users to specify the filetype via a flag rather than relying on the extension. I've thought less about output, but my impression is a similar approach would do the trick.

@spencerkclark any initial thoughts? And CCing @rabernat for any thoughts in case I'm leading us astray here.

spencerahill commented 5 years ago

Go figure, this is a bit trickier than I initially expected. Most importantly, open_mfdataset and open_zarr have different call signatures, and not all of the options we use in open_mfdataset are provided for open_zarr:

@rabernat, do you know why open_zarr doesn't use auto_combine whereas open_mfdataset does?

spencerkclark commented 5 years ago

Thanks for pinging this again @spencerahill; sorry for neglecting to respond earlier. You bring up some good points.

For input, then, insofar as users save their zarr files with the .zarr extension, we can simply use that to choose which method we use to load.

My initial thought was actually to create a separate DataLoader class strictly for zarr stores, rather than try to detect which opening method to use based on the file name. This would allow us to cleanly separate the logic for loading data from zarr versus from netCDF (and potentially work around the differences between the capabilities of open_zarr and open_mfdataset without disrupting our existing code too much).

Go figure, this is a bit trickier than I initially expected. Most importantly, open_mfdataset and open_zarr have different call signatures, and not all of the options we use in open_mfdataset are provided for open_zarr:

  • concat_dim, data_vars, coords: all used in the call to auto_combine within open_mfdataset, but open_zarr doesn't use auto_combine

I think if we focus on single-zarr-store datasets for now (just as a proof of concept) it would be OK that open_zarr does not support automatically concatenating stores upon loading.

  • preprocess: looks pretty easy to port into open_zarr from open_mfdataset

The preprocess argument seems to have been added in xarray with the motivation that one might need to apply some logic before concatenating datasets (so I guess it is not surprising it did not make it into open_zarr); we use it slightly differently in aospy (as a way for users to correct issues with CF-compliance before things are decoded, for instance, since they can't touch the data once it enters the pipeline). It might involve a little more work, but I think we could work around this too.

I'm not sure if there is a strong case for adding a preprocess argument to open_zarr in its current state (i.e. without concatenation), because in a typical script you can always open a store with decode_cf=False, correct any issues, and decode things later.

spencerahill commented 5 years ago

My initial thought was actually to create a separate DataLoader class strictly for zarr stores, rather than try to detect which opening method to use based on the file name. This would allow us to cleanly separate the logic for loading data from zarr versus from netCDF (and potentially work around the differences between the capabilities of open_zarr and open_mfdataset without disrupting our existing code too much).

Yes, that would be a much easier first step. I guess my long term vision was for users not even having to worry about whether their data was zarr or netCDF, but I suppose that's too far ahead of things.

So for this current, proof-of-concept stage, I think @spencerkclark you're right that something like a simple ZarrDataLoader is the way to proceed.

(Actually that leads to a new idea: could we (eventually) separate the logic of what the type of data store is (zarr vs. netcdf) from the description of how the files are organized? Then we could use composition to specify any combination, e.g. a NestedDictDataLoader that uses Zarr files vs. the same but that uses netCDF.)

I'm not sure if there is a strong case for adding a preprocess argument to open_zarr in its current state (i.e. without concatenation), because in a typical script you can always open a store with decode_cf=False, correct any issues, and decode things later.

OK, that's fine by me. We should be able to replicate this logic ourselves within aospy for zarr data, because I do think we need it.

rabernat commented 5 years ago

Sorry for the slow reply here.

@rabernat, do you know why open_zarr doesn't use auto_combine whereas open_mfdataset does?

There are some points related to this topic on the pangeo website. The reason that open_zarr doesn't have these options is that open_zarr is analogous to open_dataset. There is no open_mfzarr function yet (although that would be a great xarray PR!).

The way we are using zarr, however, generally makes that sort of function obsolete. To produce zarr datasets, we commonly do something like

ds = xr.open_mfdataset('*.nc')
ds.to_zarr('big_datset.zarr')

In other words, datasets that were originally stored in hundreds or thousands of netcdf files are now stored in a single zarr store (which may contain many files, but zarr handles that part).

Actually that leads to a new idea: could we (eventually) separate the logic of what the type of data store is (zarr vs. netcdf) from the description of how the files are organized?

This sounds a lot like what intake does. You might get more mileage out of first refactoring around intake. Then you would be able to outsource all of the file loading stuff. The pangeo intake catalog for example contains both multi-netcdf file datasets and zarr datasets. The user doesn't ever have to care what the underlying driver is.

Speaking of intake, have you seen this? https://github.com/NCAR/intake-cmip

spencerahill commented 5 years ago

Thanks much @rabernat.

There are some points related to this topic on the pangeo website. The reason that open_zarr doesn't have these options is that open_zarr is analogous to open_dataset. There is no open_mfzarr function yet (although that would be a great xarray PR!).

Ah, duh. If the use case arises for us for an open_mfzarr, then I'd definitely be keen to contribute. But, as you say, that's not typically how folks have been using zarr.

This sounds a lot like what intake does. You might get more mileage out of first refactoring around intake. Then you would be able to outsource all of the file loading stuff. The pangeo intake catalog for example contains both multi-netcdf file datasets and zarr datasets. The user doesn't ever have to care what the underlying driver is.

Good point; I'll open a separate issue for us to discuss this. It's been on our radar for a while, but we didn't have a compelling reason to switch so far. But now that it's getting more and more adoption including through pangeo (including intake-cmip5...very cool!), perhaps that's no longer the case.


So all that said, I think @spencerkclark 's idea of starting with a simple ZarrDataLoader as a proof-of-concept is the best way to proceed.