Open spencerahill opened 5 years ago
Go figure, this is a bit trickier than I initially expected. Most importantly, open_mfdataset
and open_zarr
have different call signatures, and not all of the options we use in open_mfdataset
are provided for open_zarr
:
preprocess
: looks pretty easy to port into open_zarr
from open_mfdataset
concat_dim
, data_vars
, coords
: all used in the call to auto_combine
within open_mfdataset
, but open_zarr
doesn't use auto_combine
@rabernat, do you know why open_zarr
doesn't use auto_combine
whereas open_mfdataset
does?
Thanks for pinging this again @spencerahill; sorry for neglecting to respond earlier. You bring up some good points.
For input, then, insofar as users save their zarr files with the .zarr extension, we can simply use that to choose which method we use to load.
My initial thought was actually to create a separate DataLoader
class strictly for zarr stores, rather than try to detect which opening method to use based on the file name. This would allow us to cleanly separate the logic for loading data from zarr versus from netCDF (and potentially work around the differences between the capabilities of open_zarr
and open_mfdataset
without disrupting our existing code too much).
Go figure, this is a bit trickier than I initially expected. Most importantly,
open_mfdataset
andopen_zarr
have different call signatures, and not all of the options we use inopen_mfdataset
are provided foropen_zarr
:
concat_dim
,data_vars
,coords
: all used in the call toauto_combine
withinopen_mfdataset
, butopen_zarr
doesn't useauto_combine
I think if we focus on single-zarr-store datasets for now (just as a proof of concept) it would be OK that open_zarr
does not support automatically concatenating stores upon loading.
preprocess
: looks pretty easy to port intoopen_zarr
fromopen_mfdataset
The preprocess
argument seems to have been added in xarray with the motivation that one might need to apply some logic before concatenating datasets (so I guess it is not surprising it did not make it into open_zarr
); we use it slightly differently in aospy (as a way for users to correct issues with CF-compliance before things are decoded, for instance, since they can't touch the data once it enters the pipeline). It might involve a little more work, but I think we could work around this too.
I'm not sure if there is a strong case for adding a preprocess
argument to open_zarr
in its current state (i.e. without concatenation), because in a typical script you can always open a store with decode_cf=False
, correct any issues, and decode things later.
My initial thought was actually to create a separate DataLoader class strictly for zarr stores, rather than try to detect which opening method to use based on the file name. This would allow us to cleanly separate the logic for loading data from zarr versus from netCDF (and potentially work around the differences between the capabilities of open_zarr and open_mfdataset without disrupting our existing code too much).
Yes, that would be a much easier first step. I guess my long term vision was for users not even having to worry about whether their data was zarr or netCDF, but I suppose that's too far ahead of things.
So for this current, proof-of-concept stage, I think @spencerkclark you're right that something like a simple ZarrDataLoader is the way to proceed.
(Actually that leads to a new idea: could we (eventually) separate the logic of what the type of data store is (zarr vs. netcdf) from the description of how the files are organized? Then we could use composition to specify any combination, e.g. a NestedDictDataLoader that uses Zarr files vs. the same but that uses netCDF.)
I'm not sure if there is a strong case for adding a preprocess argument to open_zarr in its current state (i.e. without concatenation), because in a typical script you can always open a store with decode_cf=False, correct any issues, and decode things later.
OK, that's fine by me. We should be able to replicate this logic ourselves within aospy for zarr data, because I do think we need it.
Sorry for the slow reply here.
@rabernat, do you know why
open_zarr
doesn't useauto_combine
whereasopen_mfdataset
does?
There are some points related to this topic on the pangeo website. The reason that open_zarr
doesn't have these options is that open_zarr
is analogous to open_dataset
. There is no open_mfzarr
function yet (although that would be a great xarray PR!).
The way we are using zarr, however, generally makes that sort of function obsolete. To produce zarr datasets, we commonly do something like
ds = xr.open_mfdataset('*.nc')
ds.to_zarr('big_datset.zarr')
In other words, datasets that were originally stored in hundreds or thousands of netcdf files are now stored in a single zarr store (which may contain many files, but zarr handles that part).
Actually that leads to a new idea: could we (eventually) separate the logic of what the type of data store is (zarr vs. netcdf) from the description of how the files are organized?
This sounds a lot like what intake does. You might get more mileage out of first refactoring around intake. Then you would be able to outsource all of the file loading stuff. The pangeo intake catalog for example contains both multi-netcdf file datasets and zarr datasets. The user doesn't ever have to care what the underlying driver is.
Speaking of intake, have you seen this? https://github.com/NCAR/intake-cmip
Thanks much @rabernat.
There are some points related to this topic on the pangeo website. The reason that open_zarr doesn't have these options is that open_zarr is analogous to open_dataset. There is no open_mfzarr function yet (although that would be a great xarray PR!).
Ah, duh. If the use case arises for us for an open_mfzarr
, then I'd definitely be keen to contribute. But, as you say, that's not typically how folks have been using zarr.
This sounds a lot like what intake does. You might get more mileage out of first refactoring around intake. Then you would be able to outsource all of the file loading stuff. The pangeo intake catalog for example contains both multi-netcdf file datasets and zarr datasets. The user doesn't ever have to care what the underlying driver is.
Good point; I'll open a separate issue for us to discuss this. It's been on our radar for a while, but we didn't have a compelling reason to switch so far. But now that it's getting more and more adoption including through pangeo (including intake-cmip5...very cool!), perhaps that's no longer the case.
So all that said, I think @spencerkclark 's idea of starting with a simple ZarrDataLoader as a proof-of-concept is the best way to proceed.
Zarr is becoming the format of choice for N-D data on the cloud, with heavy usage in e.g. pangeo. @spencerkclark also has found some compelling use cases for it over netCDF on the GFDL analysis cluster, i.e. not on the cloud. And xarray has a very clean interface for zarr IO:
Dataset.to_zarr
andxr.open_zarr
.As such and based on offline conversations with @rabernat and @spencerkclark regarding using aospy within pangeo, I think it makes sense for aospy to provide zarr support. So it remains to decide how to do that.
The
open_zarr
method returns a Dataset object just asopen_mfdataset
andopen_dataset
do. So it's really purely a matter of I/O: once a Dataset is loaded from a zarr file, we can proceed with the rest of our pipeline just as if it were a netCDF file (hooray for clean interfaces!)For input, then, insofar as users save their zarr files with the
.zarr
extension, we can simply use that to choose which method we use to load: e.g. something likeThere are most likely some additional complications I haven't thought of yet, but this seems like a reasonable approach at this stage. And/or we could allow the users to specify the filetype via a flag rather than relying on the extension. I've thought less about output, but my impression is a similar approach would do the trick.
@spencerkclark any initial thoughts? And CCing @rabernat for any thoughts in case I'm leading us astray here.