Open okz opened 1 year ago
I wonder, does it work to phrase the URL as:
[f"reference://::{u}", for u in urls]
?
By the way, xarray typically still does have to do a certain amount of work in such a case, so you might want to use kerchunk.combine.MultiZarrToZarr to create a single reference set across all the inputs, so that you don't need open_mfdataset at all.
[f"reference://::{u}", for u in urls]
even providing reference
and fo
hardcoded as a list, the zarr intake plugin fails, I don't think the xzarr.ZarrSource makes an attempt to accept multiple jsons in the 'fo' which has a usecase.
AttributeError: 'list' object has no attribute 'get'.
kerchunk.combine.MultiZarrToZarr
That was the initial goal, but right now MultiZarrToZarr, only supports regular chunking between files. The data has many dimensions and most are not chunked regularly. There isn't a way around it as far as I know?
PS: Almost gave up on using kerchunk, and the open_mfdataset
approach, although not the best, was a lifesaver. Maybe it should be documented somewhere? It's still several factors faster than opening direct netcdf files.
It ought to not be too complex to fold this into intake-xarray. We do try to stay close to what xarray itself offers, so one could argue that if open_mfdataset accepts a lis of URLs or paths, it should allow for a list of storage_options-per-path too, and then everyone gets this kind of workflow, not just intake users.
The data has many dimensions and most are not chunked regularly. There isn't a way around it as far as I know?
We require ZEP003 in zarr. Please ping the discussion and this draft implementation: https://github.com/zarr-developers/zarr-python/pull/1483
I just came across this issue as I was searching for an option to merge two datasets originating from two kerchunk reference datasets with different chunk sizes.
I tested the workflow with xr.open_mfdataset
and can confirm that the url chaining with several fo
works!
import xarray as xr
xr.open_mfdataset(['reference://::ref1.json', 'reference://::ref2.json'], engine='zarr', storage_options={'remote_protocol':'s3', 'remote_options':{'anon':'true'}})
and it also works with intake:
sources:
some_dataset:
driver: zarr
args:
urlpath:
- reference://::ref1.json
- reference://::ref2.json
storage_options:
remote_protocol: s3
remote_options:
anon: true
Here is a working example:
import intake
cat = intake.open_catalog("https://github.com/ISSI-CONSTRAIN/isccp/raw/main/catalog.yaml")
cat['ISCCP_BASIC_HGH'].to_dask()
Standard intake plugins seem to support glob,
*
orlist
urlpath's, to consume multiple files withopen_mfdataset
. This aproach isn't suitable for theintake_xarray.xzarr.ZarrSource
plugin since it expects the(urlpath: "reference://")
, and usesstorage_options::fo
to load the sideload file:Ideally catalog
fo
, should be able to acceptglob
paths ?More details:
Having many netcdf files with variable dimensions, we hit the "irregular chunk size between files issue" trying to use kerchunk.
So instead of combining netcdf files, to a single sideload json file, we created a sideload .json for each netcdf file, and let xarray take care of the merge. For our datasets this was good enough, and made working with several months of remote data, possible.
Using xarray
open_mfdataset
directly, it was possible to use multiple jsons. e.g:It would have been nice to get rid of this code, and use an intake catalog.