intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
74 stars 36 forks source link

xarray.open_zarr to be deprecated #70

Closed Mikejmnez closed 1 year ago

Mikejmnez commented 4 years ago

In the future, zarr files/stores will be opened with xarray.open_dataset as follows

ds = xarray.open_dataset(store, engine="zarr", ...)

Thus, (eventually) there needs to be a change on how intake-xarray will open zarr stores. Naively, this can be done by specifying open_dataset rather than open_zarr. This is, something like:

store = 'directoryA/subdirectory/zarr_store
self._mapper = fsspec.get_mapper(store, ...) 
_open = xr.open_dataset
self._ds = open(self._mapper, engine="zarr", ...)

However, xarray.open_dataset does not recognize output from fsspec.get_mapper. It works if store (as defined above) is passed.

On xarray.open_zarr, _mapper gets transformed into a ZarrStore and later decoded. This is, given the _mapper, the following will open the zarr store:

from xarray.backends.zarr import ZarrStore
from xarray import conventions
zarr_store = ZarrStore.open_group(_mapper, ...)
ds = conventions.decode_cf(zarr_store,...)

This brings two options IMHO:

  1. Drop using fsspec.get_mapper (not likely) and just pass the url/path as argument to xarray.open_dataset. (very unlikely), or
  2. Follow along the lines of the pseudo code above, and rather than import/use xarray.open_dataset directly, import ZarrStore.open_group and convenctions.decode_cf to open zarr stores.

It is my understanding, that zarr will potentially depend more on fsspec as it gets more developed, and thus No. 2 seems more likely.

Or, is there another secret option number 3 I fail to see?

https://github.com/pydata/xarray/pull/4003 https://github.com/intake/filesystem_spec/issues/286 https://github.com/zarr-developers/zarr-python/pull/546

martindurant commented 4 years ago

https://github.com/zarr-developers/zarr-python/pull/546 will add the get_mapper route into zarr itself, so that you can pass URL and storage_options. This may take a while to merge and release, though.

However, I am surprised: I would think that with engine=zarr, xarray would pass through the mapper object to zarr, which knows of course what to do with it.

Is the CF decode the only step that xarray is adding?

Mikejmnez commented 4 years ago

Good to know about the zarr development route. I can't wait to see the new behavior, although I understand the timeline...

Here is a better description of the problem with fs.get_mapper on intake-xarray

store = 'directoryA/subdirB/zarr_store'
_mapper = fsspec.get_mapper(store)
ds = xarray.open_dataset(_mapper, chunks="auto",...)

raises the error:

ValueError: can only read bytes of file-like objects with engine="scipy" or "h5netcdf"

Now, engine="zarr" is in the list of engines, the problem again is with _mapper within xarray.open_dataset. In xarray, the instance file_or_obj=_mapper needs to be recognized as either str, Path or AbstractDatastore in order for the open_dataset arguments to be passed to ZarrStore.open_group. This is where it fails. That is, all of

In []: isinstance(_mapper, str)
out []: False
In []: isinstance(_mapper, Path)
Out []: False
In []: isinstance(_mapper, AbstractDatastore)
Out []: False

At least one of the above needs to be True, or the ValueError above get triggered.

martindurant commented 4 years ago

Right - I would think that when engine="zarr", there should be an additional case isinstance(obj, MutableMapping) after the datastore check (which is also, presumably, dict-like).

Mikejmnez commented 4 years ago

Yes, that does it. That is

In []: from collections.abc import MutableMapping
In []: isinstance(mapper, MutableMapping)
Out []: True

Thanks @martindurant !

Mikejmnez commented 4 years ago

With this sorted (and working), should I submit a PR to set Intake-xarray to use open_dataset when opening zarr stores (i have a working fork)?

martindurant commented 4 years ago

Presumably yes, after xarray is released

weiji14 commented 3 years ago

FYI, https://github.com/pydata/xarray/pull/4187 was merged (a continuation of https://github.com/pydata/xarray/pull/4003) and should be available for the next release of xarray (0.16.2?). Note that xr.open_zarr function was not deprecated, but is kept around as a short alias for xr.open_dataset(store, engine="zarr").

Still, it might be preferable to switch intake-xarray to the new syntax, as multiple Zarr files can now be opened with xarray using xr.open_mfdataset(stores, engine="zarr"). I see that https://github.com/zarr-developers/zarr-python/pull/546 was merged recently too, so someone with time might want to try and see how these different pieces can be fit together to improve intake-xarray :wink:.

martindurant commented 3 years ago

Right, so I believe xr.open_dataset(URL, engine="zarr") should now work, and we should make xr.open_mfdataset(URLs, engine="zarr") work (where URLs can be a list, or a glob-string). Then theres no reason to be creating mappers in intake-xarray any more; as you say, it just needs a little effort to plumb through things like storage_options.

Mikejmnez commented 3 years ago

Awesome, I can take a look in the next couple of days and see what I can do. It shouldn't be much trouble, I hope. Thanks a lot @weiji14!