Passing multiple kerchunk sideload files to `open_mfdataset`, not possible with intake

okz commented 1 year ago

Standard intake plugins seem to support glob, * or list urlpath's, to consume multiple files with open_mfdataset. This aproach isn't suitable for the intake_xarray.xzarr.ZarrSource plugin since it expects the (urlpath: "reference://"), and uses storage_options::fo to load the sideload file:

    driver: intake_xarray.xzarr.ZarrSource
    args:
      urlpath: "reference://"
      storage_options:
        fo: "sideload.json"

Ideally catalog fo, should be able to accept glob paths ?

More details:

Having many netcdf files with variable dimensions, we hit the "irregular chunk size between files issue" trying to use kerchunk.
So instead of combining netcdf files, to a single sideload json file, we created a sideload .json for each netcdf file, and let xarray take care of the merge. For our datasets this was good enough, and made working with several months of remote data, possible.

Using xarray open_mfdataset directly, it was possible to use multiple jsons. e.g:

m_list = []
for js in urls:
    with fs.open(js) as f:
        m_list.append(fsspec.get_mapper("reference://", 
                      fo=ujson.load(f), remote_protocol="file",
                      remote_options=so))

ds = xr.open_mfdataset(m_list, engine='zarr', 
                        combine="nested", 
                        backend_kwargs={
                            "consolidated": False, 
                        },
                        concat_dim="time")

It would have been nice to get rid of this code, and use an intake catalog.

martindurant commented 1 year ago

I wonder, does it work to phrase the URL as:

[f"reference://::{u}", for u in urls]

?

By the way, xarray typically still does have to do a certain amount of work in such a case, so you might want to use kerchunk.combine.MultiZarrToZarr to create a single reference set across all the inputs, so that you don't need open_mfdataset at all.

okz commented 1 year ago

[f"reference://::{u}", for u in urls]

even providing reference and fo hardcoded as a list, the zarr intake plugin fails, I don't think the xzarr.ZarrSource makes an attempt to accept multiple jsons in the 'fo' which has a usecase. AttributeError: 'list' object has no attribute 'get'.

kerchunk.combine.MultiZarrToZarr

That was the initial goal, but right now MultiZarrToZarr, only supports regular chunking between files. The data has many dimensions and most are not chunked regularly. There isn't a way around it as far as I know?

PS: Almost gave up on using kerchunk, and the open_mfdataset approach, although not the best, was a lifesaver. Maybe it should be documented somewhere? It's still several factors faster than opening direct netcdf files.

martindurant commented 1 year ago

It ought to not be too complex to fold this into intake-xarray. We do try to stay close to what xarray itself offers, so one could argue that if open_mfdataset accepts a lis of URLs or paths, it should allow for a list of storage_options-per-path too, and then everyone gets this kind of workflow, not just intake users.

The data has many dimensions and most are not chunked regularly. There isn't a way around it as far as I know?

We require ZEP003 in zarr. Please ping the discussion and this draft implementation: https://github.com/zarr-developers/zarr-python/pull/1483

observingClouds commented 5 months ago

I just came across this issue as I was searching for an option to merge two datasets originating from two kerchunk reference datasets with different chunk sizes.

I tested the workflow with xr.open_mfdataset and can confirm that the url chaining with several fo works!

import xarray as xr
xr.open_mfdataset(['reference://::ref1.json', 'reference://::ref2.json'], engine='zarr', storage_options={'remote_protocol':'s3', 'remote_options':{'anon':'true'}})

and it also works with intake:

sources:
  some_dataset:
    driver: zarr
    args:
      urlpath:
        - reference://::ref1.json
        - reference://::ref2.json
      storage_options:
        remote_protocol: s3
        remote_options:
          anon: true

observingClouds commented 5 months ago

Here is a working example:

import intake
cat = intake.open_catalog("https://github.com/ISSI-CONSTRAIN/isccp/raw/main/catalog.yaml")
cat['ISCCP_BASIC_HGH'].to_dask()

intake / intake-xarray

Passing multiple kerchunk sideload files to `open_mfdataset`, not possible with intake #135