Open TomAugspurger opened 2 years ago
Thanks for posting Tom. This is indeed an important and very common use case we need to support.
https://github.com/fsspec/kerchunk/pull/122 allows merging variables in kerchunk at the same time as concatenating dimensions, but I'm not certain you need it here.
[Martin] https://github.com/fsspec/kerchunk/pull/122 allows merging variables in kerchunk at the same time as concatenating dimensions
Wasn't sure if this was a request to try it out or not, but an initial test using that branch failed with MultiZarrToZarr
not taking xarary_open_kwargs
anymore. I didn't investigate any further.
[Tom] Possibly related, but the offsets discovered by kerchunk don't seem to be correct. I think they're loading the whole file, rather than a single chunk, so ds.clt[0, 0].compute() is slow. That could be user error though.
This is looking more like user error / misunderstanding. It turns out that this dataset isn't chunked:
>>> import h5py
>>> f = h5py.File(fsspec.open(list(pattern.items())[0][1], **remote_options).open())
>>> clt = f["clt"]
>>> clt.chunks # None
And yet somehow some combination of xarray / fsspec / h5netcdf / h5py are smart enough to not request all the data when you do a slice:
%time clt[0, 0]
CPU times: user 16.4 ms, sys: 5.05 ms, total: 21.5 ms
Wall time: 64.6 ms
The full dataset is ~2GiB, which takes closer to 24s to read. We might want a separate feature for splitting chunks (should be doable, if we know the original buffer is contiguous?) but that's unrelated to this issue.
Wasn't sure if this was a request to try it out or not, but an initial test using that branch failed with MultiZarrToZarr not taking xarary_open_kwargs anymore. I didn't investigate any further.
Yes, the signature changed a lot (sorry) and the code no longer uses xarray at all, in favour of direct JSON mangling. The docstring is up to date, though, and I think I know the specific things that don't work, listed in the description of the PR.
We might want a separate feature for splitting chunks (should be doable, if we know the original buffer is contiguous?) but that's unrelated to this issue.
Certainly on the cards, and this is actively discussed for FITS, where uncompressed cingle chunks are common, but I didn't think that would be the case for HDF5.
👍, opened https://github.com/fsspec/kerchunk/issues/124 for that.
Currently,
HDFReferenceRecipe
doesn't work properly with both a concat dim while merging multiple variables.I suspect that this is blocked by the upstream issue in Kerchunk: https://github.com/fsspec/kerchunk/issues/106#issuecomment-987962026. Filing this to make sure we check back here when unblocked.
Here's the code for the recipe (runs in ~8s in the West Europe Azure region)
And to load it and the expected output from
xarray.open_mfsdatset
.