intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
74 stars 36 forks source link

Support for JSON metadata workflow #124

Closed b-pos465 closed 2 years ago

b-pos465 commented 2 years ago

Use Case

I am trying to access NetCDF4 data via JSON metadata with intake-xarray. This approach is based on this blog post by lsterzinger. I am trying to make the data access as convenient as possible. The ideal solution for me with the existing API would look like this:

# using open_netcdf
json_source = intake.open_netcdf('/home/jovyan/work/output/s3/combine.json', xarray_kwargs={'engine': 'zarr', 'consolidated': False})

# using open_zarr
json_source = intake.open_zarr('/home/jovyan/work/output/s3/combine.json', consolidated=False)

When testing this approach I get the following error:

...

File /opt/conda/lib/python3.9/site-packages/zarr/hierarchy.py:1057, in _normalize_store_arg(store, storage_options, mode)
   1055 if store is None:
   1056     return MemoryStore()
-> 1057 return normalize_store_arg(store,
   1058                            storage_options=storage_options, mode=mode)

File /opt/conda/lib/python3.9/site-packages/zarr/storage.py:123, in normalize_store_arg(store, storage_options, mode)
    121         return N5Store(store)
    122     else:
--> 123         return DirectoryStore(store)
    124 else:
    125     if not isinstance(store, BaseStore) and isinstance(store, MutableMapping):

File /opt/conda/lib/python3.9/site-packages/zarr/storage.py:844, in DirectoryStore.__init__(self, path, normalize_keys, dimension_separator)
    842 path = os.path.abspath(path)
    843 if os.path.exists(path) and not os.path.isdir(path):
--> 844     raise FSPathExistNotDir(path)
    846 self.path = path
    847 self.normalize_keys = normalize_keys

FSPathExistNotDir: path exists but is not a directory: %r

The approach from the blog post uses an FSMap. So I tried the following:

fs = fsspec.filesystem(
    "reference", 
    fo="/home/jovyan/work/output/s3/combine.json", 
    remote_protocol="file",
    skip_instance_cache=True
)
m = fs.get_mapper("")

json_source = intake.open_zarr(m, engine='zarr')
json_source.discover()

This one works. But it kind of misses the point of Intake as the user has to know about the fsspec API to create a working FSMap.

Suggestion

Version 1

I would like to implement an extra case for the open_zarr method to support the JSON workflow introduced in the blog post mentioned above.

Version 2

I could also imagine an extra method for the JSON workflow, something like intake.open_zarr_metadata('combine.json').

Questions

  1. Which approach would you prefer?

  2. While looking through existing issues I found #70. If I get it correctly, you removed the fsspec mapper 2020 as it wasn't needed anymore. Is there another solution to bring the JSON workflow to intake-xarray that I overlooked?

  3. Unfortunately, my Python knowledge is limited so I have no idea how to test a modified version of intake-array. I found https://intake-xarray.readthedocs.io/en/latest/contributing.html#id9 to run tests. But how can I test a modified version of intake-array with Intake locally? Would be great to have this in the docs!

martindurant commented 2 years ago

This does already work, but the invocation via intake-xarray (or xarray open_dataset directly) is complex. Actually, intake-xarray is great exactly because it hides this complexity from the user once you've figured it out. Your call should look something like

source = intake.open_zarr(
    "reference://",
    storage_options={
        "fo": '/home/jovyan/work/output/s3/combine.json',
        "remote_protocol": "...",  # e.g., "s3", "http", ...
        "remote_options": {...}  # anything needed to configure that remote filesystem
    },
    consolidated=False
)

And yes, open_netcdf essentially does the same thing, except that you specify the engine, and all those arguments get nested inside a "backend_kwargs".

martindurant commented 2 years ago

If you succeed in generating an interesting dataset and would like to share in public, the kerchunk project would like to know about it!

b-pos465 commented 2 years ago

Thank you for your help! Your approach works perfectly fine.

I was able to generate a YAML file from the source above and load it back in.

Actually, I am not working on a dataset but on a web-based tool for migrating NETCDF4 data to Zarr. It supports both an actual conversion and the JSON metadata workflow mentioned above. Right now I am working on the Intake integration for the JSON metadata. Here is a link to the repository: https://github.com/climate-v/nc2zarr-webapp

martindurant commented 2 years ago

Are you aware of https://pangeo-forge.org/