pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Rechunking does not produce .zmetadata #114

Closed trondactea closed 2 years ago

trondactea commented 2 years ago

First, thanks for this great toolbox!

I have to rechunk an existing global zarr dataset (GLORYS ocean model) with existing chunks (1, 50, 2041, 4320) (time,depth,lat,lon). Using this global dataset, I frequently extract regional domains that are typically 10x10 degrees lat-lon in size. I thought quicker read access would be achieved if I rechunked to (324,50,100,100).

The conversion went well with rechunker but when trying to read the dataset using xarray.open_zarr it fails due to missing .zmetadata. The original .zarrdata set has consolidated metadata available.

Is there an option to create the metadata, or is my approach wrong here? My code for converting the existing zarr dataset is below. I appreciate any help here!

Thanks, Trond

import xarray as xr
import datetime as dt
import gcsfs
import dask
import os
import shutil
from google.cloud import storage
from dask.distributed import Client
from rechunker import rechunk
import dask.array as dsa

fs = gcsfs.GCSFileSystem(token='google_default')

for var_name in ["thetao"]:
    zarr_url = f"gs://shared/zarr/copernicus/{var_name}"

    mapper = fs.get_mapper(zarr_url)

    source_array = xr.open_zarr(mapper, consolidated=True)
    print(source_array.chunks)

    max_mem = '1GB'
    target_chunks = {'time': 324, 'latitude': 100, 'longitude': 100}

    # you must have write access to this location
    store_tmp = fs.get_mapper('gs://shared/zarr/temp.zarr')
    store_target = fs.get_mapper('gs://shared/zarr/target.zarr')
    r = rechunk(source_array, target_chunks, max_mem, store_target, temp_store=store_tmp)

    result = r.execute()
    dsa.from_zarr(result)
jbusecke commented 2 years ago

I think you need to actually consolidate the metadata in a separate step. See here

rabernat commented 2 years ago

The conversion went well with rechunker but when trying to read the dataset using xarray.open_zarr it fails due to missing .zmetadata

Can you share the full error traceback you obtained?

trondactea commented 2 years ago

The full traceback is shown below when I try to run:


for var_name in ["thetao"]:
    zarr_url = f"gs://shared/zarr/target.zarr/{var_name}"
    mapper = fs.get_mapper(zarr_url)
    ds = xr.open_zarr(mapper, consolidated=True)

Traceback:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jovyan/datasets/run_zarr_tests.py", line 32, in <module>
    ds = xr.open_zarr(mapper, consolidated=True)
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 768, in open_zarr
    ds = open_dataset(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/api.py", line 495, in open_dataset
    backend_ds = backend.open_dataset(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 824, in open_dataset
    store = ZarrStore.open_group(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 384, in open_group
    zarr_group = zarr.open_consolidated(store, **open_kwargs)
  File "/opt/conda/lib/python3.9/site-packages/zarr/convenience.py", line 1183, in open_consolidated
    meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
  File "/opt/conda/lib/python3.9/site-packages/zarr/storage.py", line 2590, in __init__
    meta = json_loads(store[metadata_key])
  File "/opt/conda/lib/python3.9/site-packages/fsspec/mapping.py", line 139, in __getitem__
    raise KeyError(key)
KeyError: '.zmetadata'
rabernat commented 2 years ago

Ah ok, so your options are

ds = xr.open_zarr(mapper, consolidated=False)

or

from zarr.convenience import consolidate_metadata
consolidate_metadata(mapper)
ds = xr.open_zarr(mapper, consolidated=True)

Are you suggesting that we should automatically consolidate the target within rechunker?

trondactea commented 2 years ago

I thought that the availability of .zmetadata for large datasets speeds up performance. If I can create the metadata using the zarr function that works well of course. For me, the automatic creation of .zmetadata would be very useful, but I don't have deep experience with zarr. Thanks for your help.

rabernat commented 2 years ago

I thought that the availability of .zmetadata for large datasets speeds up performance.

It can speeds up the process of initializing the dataset itself (xr.open_zarr) if the underlying store (GCS in this case) is slow to list. There is no performance impact after that.

trondactea commented 2 years ago

That makes sense. Thanks, @rabernat and @jbusecke.