google-research / arco-era5

Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets.
https://cloud.google.com/storage/docs/public-datasets/era5
Apache License 2.0
287 stars 22 forks source link

Rechunk Pressure level data in lat lon dataset #69

Open loliverhennigh opened 6 months ago

loliverhennigh commented 6 months ago

I have been using the latlon dataset here gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3. It has been extremely helpful for setting up different projects. I am wondering if it would be possible to rechunk the pressure level data. Currently all pressure levels are in a single chunk. If we want to sub sample we will end up getting the entire chunk which can significantly slow down the bandwidth. Ideally given this is in object storage we could use much smaller chunk sizes and just have the chunks be the lat long grid. What do you thinks?

shoyer commented 5 months ago

This is a lot of data, so I don't think we're going to store another duplicate version of this dataset. But there are a number of tools for rechunking the data yourself, e.g., see rechunker or xarray-beam