pangeo-data / zarr-proxy

A proxy for Zarr stores that allows for chunking overrides.
Apache License 2.0
9 stars 3 forks source link

Zarr Proxy

✨ This code is highly experimental! Let the buyer beware ⚠️ ;) ✨

CI GitHub Workflow Status Code Coverage Status pre-commit.ci status
Docs Documentation Status
Package Conda PyPI
License License

A proxy for Zarr stores that allows for chunking overrides. This is useful for clients that want to request data in a specific chunking scheme, but the data is stored in a different chunking scheme (e.g. a dataset stored in a chunking scheme that is optimized for fast reading, but the client wants to request data in a chunking scheme that is optimized for fast rendering). One advantage of using a proxy is that we don't need to persistently store the data in multiple chunking schemes. Instead, we can simply request the data in the desired chunking scheme on the fly.

Usage

The proxy is a simple FastAPI application. It can be run locally using the following command:

uvicorn zarr_proxy.main:app --reload

Once the proxy is running, you can use it to access a Zarr store by using the following URL pattern: http://{PROXY_ADDRESS}/{ZARR_STORE_ADDRESS}. For example, if the proxy is running on localhost:8000 and you want to access the Zarr store at https://my.zarr.store, you would use the following URL: http://localhost:8000/my.zarr.store.

The proxy supports the following HTTP headers:

Python client

Before constructing the chunks header, a Python client might inspect the dataset .zmetadata to determine the existing chunking of each variable. This can be done using the requests library:

import requests

proxy_zarr_store = 'http://localhost:8000/my.zarr.store'
# get zmetadata
zmetadata = requests.get(f'{proxy_zarr_store}/.zmetadata').json()
print(zmetadata)

Once the .zmetadata has been retrieved, the client can construct the chunks header. For example, the following code will construct a chunks header that overrides the chunking of temperature and pressure variables(arrays) to be 256x256x30:

chunks='temperature=256,256,30,pressure=256,256,30'

We can then use the chunks header to construct a Zarr store and by passing the chunks header to the client_kwargs argument of the zarr.storage.FSStore constructor:

import zarr
store = zarr.storage.FSStore(proxy_zarr_store, client_kwargs={'headers': {"chunks": chunks}})

This store can be then used via the Xarray library:

import xarray as xr
ds = xr.open_dataset(store, engine='zarr', chunks={})

JavaScript client

A web-based client might prefetch and inspect dataset .zmetadata before constructing a Headers object with desired chunks header(s) to pass on to a Zarr client.

In this example, the getHeaders() constructor includes chunks headers for all variables whose existing chunking does not meet the use-case-specific chunk "cap" requirements:

const getHeaders = (variables, zmetadata, axes) => {
  const headers = [];

  variables.forEach((variable) => {
    const existingChunks = zmetadata.metadata[`${variable}/.zarray`].chunks;
    const dims = zmetadata.metadata[`${variable}/.zattrs`]["_ARRAY_DIMENSIONS"];
    const { X, Y } = axes[variable];

    // cap spatial dimensions at length 256, cap non-spatial dimensions at length 30
    const limits = dims.map((d) => ([X, Y].includes(d) ? 256 : 30));
    const override = getChunkShapeOverride(existingChunks, limits);

    if (override) {
      shape.push(["chunks", `${variable}=${override.join(",")}`]);
    }
  });

  return new Headers(headers);
};