pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
123 stars 54 forks source link

Support concat and merge dimensions with `HDFReferenceRecipe` #277

Open TomAugspurger opened 2 years ago

TomAugspurger commented 2 years ago

Currently, HDFReferenceRecipe doesn't work properly with both a concat dim while merging multiple variables.

I suspect that this is blocked by the upstream issue in Kerchunk: https://github.com/fsspec/kerchunk/issues/106#issuecomment-987962026. Filing this to make sure we check back here when unblocked.

Here's the code for the recipe (runs in ~8s in the West Europe Azure region)

```python import planetary_computer import adlfs import datetime import pangeo_forge_recipes.patterns import pangeo_forge_recipes.recipes import tempfile from fsspec.implementations.local import LocalFileSystem from pangeo_forge_recipes.storage import FSSpecTarget, MetadataTarget fs = adlfs.AzureBlobFileSystem("ukmeteuwest") credential = planetary_computer.sas.get_token("ukmeteuwest", "ukcp18").token fs = adlfs.AzureBlobFileSystem("ukmeteuwest", credential=credential) files = [x for x in fs.find("ukcp18/") if x.endswith(".nc")] daily_files = [x for x in files if "/day/" in x] # limit to 3 for testing time_ranges = [f.split("_")[-1].split(".")[0] for f in daily_files][:3] time_ranges = {i: r for i, r in enumerate(dict.fromkeys(time_ranges))} def make_path(variable, time): return f"az://ukcp18/badc/ukcp18/data/land-gcm/global/60km/rcp26/01/{variable}/day/v20200302/{variable}_rcp26_land-gcm_global_60km_01_day_{time_ranges[time]}.nc" # variables = ['clt', 'hurs', 'huss', 'pr', 'psl', 'rls', 'rss', 'tas', 'tasmax', 'tasmin'] # limit to 2 for testing variables = ["clt", "hurs"] variable_merge_dim = pangeo_forge_recipes.patterns.MergeDim("variable", variables) time_concat_dim = pangeo_forge_recipes.patterns.ConcatDim("time", list(time_ranges)) pattern = pangeo_forge_recipes.patterns.FilePattern(make_path, variable_merge_dim, time_concat_dim) recipe = pangeo_forge_recipes.recipes.reference_hdf_zarr.HDFReferenceRecipe( pattern, netcdf_storage_options=dict(account_name="ukmeteuwest", credential=credential), template_count=1, ) fs_local = LocalFileSystem() cache_dir = tempfile.TemporaryDirectory() metadata_cache = MetadataTarget(fs_local, cache_dir.name) target_dir = tempfile.TemporaryDirectory() target = FSSpecTarget(fs_local, target_dir.name) recipe.target = target recipe.metadata_cache = metadata_cache recipe.to_dask().compute() ```

And to load it and the expected output from xarray.open_mfsdatset.

```python import pathlib import json import itertools import planetary_computer import fsspec import copy import xarray as xr refs = json.loads(pathlib.Path(target._full_path("reference.json")).read_text()) # Workaround a separate issue refs["refs"] = {k: v for k, v in refs["refs"].items() if "yyymmdd" not in k} remote_options = dict(account_name="ukmeteuwest", credential=credential) store = fsspec.filesystem("reference", fo=refs, remote_options=remote_options).get_mapper("") ds = xr.open_dataset(store, engine="zarr", chunks={}, consolidated=False, decode_times=False) # compare against the expected clt_files = [x for x in daily_files if "/clt/" in x][:3] ds2 = xr.open_mfdataset([fs.open(f) for f in clt_files], concat_dim="time", combine="nested", join="exact") ds2.clt ``` Checking the output ```python >>> assert ds.clt.shape == ds2.clt.shape, (ds.clt.shape, "!=", ds2.clt.shape) AssertionError Traceback (most recent call last) Input In [28], in ----> 1 assert ds.clt.shape == ds2.clt.shape, (ds.clt.shape, "!=", ds2.clt.shape) AssertionError: ((1, 21600, 324, 432), '!=', (1, 10800, 324, 432)) ``` It seems that the first variable, `clt`, is included. The second, `hurs` is not present in the store / Dataset. Additionally, the one variable is twice as long in the time dimension as it should be. Possibly related, but the offsets discovered by kerchunk don't seem to be correct. I think they're loading the whole file, rather than a single chunk, so `ds.clt[0, 0].compute()` is slow. That could be user error though.
rabernat commented 2 years ago

Thanks for posting Tom. This is indeed an important and very common use case we need to support.

martindurant commented 2 years ago

https://github.com/fsspec/kerchunk/pull/122 allows merging variables in kerchunk at the same time as concatenating dimensions, but I'm not certain you need it here.

TomAugspurger commented 2 years ago

[Martin] https://github.com/fsspec/kerchunk/pull/122 allows merging variables in kerchunk at the same time as concatenating dimensions

Wasn't sure if this was a request to try it out or not, but an initial test using that branch failed with MultiZarrToZarr not taking xarary_open_kwargs anymore. I didn't investigate any further.


[Tom] Possibly related, but the offsets discovered by kerchunk don't seem to be correct. I think they're loading the whole file, rather than a single chunk, so ds.clt[0, 0].compute() is slow. That could be user error though.

This is looking more like user error / misunderstanding. It turns out that this dataset isn't chunked:

>>> import h5py

>>> f = h5py.File(fsspec.open(list(pattern.items())[0][1], **remote_options).open())
>>> clt = f["clt"]
>>> clt.chunks  # None

And yet somehow some combination of xarray / fsspec / h5netcdf / h5py are smart enough to not request all the data when you do a slice:

%time clt[0, 0]
CPU times: user 16.4 ms, sys: 5.05 ms, total: 21.5 ms
Wall time: 64.6 ms

The full dataset is ~2GiB, which takes closer to 24s to read. We might want a separate feature for splitting chunks (should be doable, if we know the original buffer is contiguous?) but that's unrelated to this issue.

martindurant commented 2 years ago

Wasn't sure if this was a request to try it out or not, but an initial test using that branch failed with MultiZarrToZarr not taking xarary_open_kwargs anymore. I didn't investigate any further.

Yes, the signature changed a lot (sorry) and the code no longer uses xarray at all, in favour of direct JSON mangling. The docstring is up to date, though, and I think I know the specific things that don't work, listed in the description of the PR.

We might want a separate feature for splitting chunks (should be doable, if we know the original buffer is contiguous?) but that's unrelated to this issue.

Certainly on the cards, and this is actively discussed for FITS, where uncompressed cingle chunks are common, but I didn't think that would be the case for HDF5.

TomAugspurger commented 2 years ago

👍, opened https://github.com/fsspec/kerchunk/issues/124 for that.