jbusecke / esgf-virtual-zarr-data-access

ESGF working group to enable data access via virtual zarrs.
Apache License 2.0
5 stars 1 forks source link

esgf-virtual-zarr-data-access

ESGF working group to enable data access via virtual zarrs.

Motivation

We aim to establish streaming access via zarr as a officially supported access pattern next to local downloading. https://github.com/ESGF/esgf-roadmap/issues/5 provides more justification.

This effort draws heavily from the experience of the Pangeo / ESGF Cloud Data Working Group We aim to do this:

Guide

  1. Install the required dependencies via pip

    mamba create -n esgf-virtual-zarr-data-access python=3.11
    mamba activate esgf-virtual-zarr-data-access
    pip install -r requirements.txt
  2. Modify the urls, and the output json filename in virtual-zarr-script.py, and run the script.

    python virtual-zarr-script.py
  3. Check that the generated JSON file is readable with xarray and average the full dataset (this is also done in the script)

import xarray as xr
ds = xr.open_dataset(
    '<your_filename>.json', 
    engine='kerchunk',
    chunks={},
)
ds.mean().load() # test that all chunks can be accessed.

Goals

On the Tenth Earth System Grid Federation (ESGF) Hybrid Conference we discussed the option to serve virtualized zarr files (kerchunk reference files for demonstration's sake). We saw an excellent demo by @rhysrevans3 who showed how to serve both the virtual zarr and the individual netcdf files as a STAC catalog.

Milestones

Examples:

import xarray as xr
from dask.diagnostics import ProgressBar
DSID="CMIP6.CMIP.NCAR.CESM2.historical.r1i1p1f1.Amon.pr.gn.v20190401"
esgf_url = f"http://esgf-data4.llnl.gov/thredds/fileServer/user_pub_work/vzarr/{DSID}.json"
ds = xr.open_dataset(
    esgf_url, 
    engine='kerchunk',
    chunks={},
)
with ProgressBar():
    a = ds.mean().load()

Why not Kerchunk?

Open Questions

Upstream Requirements