google-research / arco-era5

Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets.
https://cloud.google.com/storage/docs/public-datasets/era5
Apache License 2.0
287 stars 22 forks source link

How to increase the speed of saving ERA5 chunks data? #65

Open yangxianke opened 9 months ago

yangxianke commented 9 months ago

Hi, everyone. It is really convenient to access ERA5 data from cloud storage. However, it's very slow to save the processed data as netcdf format. It has taken 40 minutes so far and still has not been saved successfully. How can I solve this problem and increase the speed of saving ERA5 chunks data?This is my code.

import xarray as xr
reanalysis = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2', 
    chunks={'time': 48},
    consolidated=True)

## data_CN Size: 119G
data_CN = reanalysis["2m_temperature"].loc["1961-01-01":'2020-12-31',60:0,70:130]  

## data_CN_daily Size: 159M
data_CN_daily = data_CN.resample(time="M").mean()

data_CN_daily = data_CN_daily.compute()    ## This process will cost long time (at least 50min+)

data_CN_daily.to_netcdf("data_ch.cn")
thorinf-orca commented 7 months ago
subset = ds[var].sel(latitude=latitude_slice, longitude=longitude_slices)
path = osp.join(out_dir, f"{var}_{region}.zarr")
mode = 'a' if osp.exists(path) and os.listdir(path) else 'w'
subset.to_zarr(path, mode=mode, compute=True)

I'm trying something similar, but noticing a very slow increase in my output zarr footprint, 0.5MB/s at best.

shoyer commented 3 months ago

The problem is that the data is stored in a way that only makes it efficient to access all locations at once. If you slice out a small area and load all times, you are effectively loading data for the entire globe.

To fix this, you could use a tool like "rechunker" to convert these arrays into a format that allows for efficient queries across time: https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11

chudlerk commented 2 months ago

The problem is that the data is stored in a way that only makes it efficient to access all locations at once. If you slice out a small area and load all times, you are effectively loading data for the entire globe.

To fix this, you could use a tool like "rechunker" to convert these arrays into a format that allows for efficient queries across time: https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11

Could you provide an example of the optimal way to do this? Lets say I just need data at one latitude/longitude/level, but for the entire record.