google-research / arco-era5

Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets.
https://cloud.google.com/storage/docs/public-datasets/era5
Apache License 2.0
287 stars 22 forks source link

Race condition when resizing existing datasets #81

Closed shoyer closed 2 weeks ago

shoyer commented 1 month ago

Our utility for resizing an existing Zarr dataset has a race condition.

If a user attempts to access the Zarr store while it is being resized, different variables will have inconsistent sizes. To some extent, this is mitigated by the use of consolidated metadata. If entirely new chunks are written, then a user opening a Zarr store with resized arrays will see the new data until the metadata is consolidated. However, we store time variables in a single chunk in order to enable efficiently downloading the data, which means the time variable will not be readable (and the dataset will not be openable) while the resizing is in progress. Instead, the user will see a transitory error along the lines of ValueError: destination buffer too small; expected at least A, got B from Zarr.

I'm not sure how to best handle this given that GCS does support atomic updates for multiple files at the same time. Fully solving the issue may require upstream fixes in Zarr. In practice, we could probably significantly reduce the likelihood of issues by ensuring that the update to the time variable's data and writing the updated consolidated metadata happens at almost exactly the same time.

shoyer commented 1 month ago

To solve this issue in ARCO-ERA5, I suggest the following steps:

  1. Measure how long datasets cannot be read with xarray.open_zarr(), i.e., the time between resizing the time variable and finishing re-consolidation. If this is longer than a few seconds, we have a problem. Based on my experience with conslidate_metadata, I suspect our larger datasets (with 100+ variables) are currently unavailable for a few minutes, because consolidate_metadata opens each array in a separate request. But we should measure this to be sure.
  2. Reduce the time when datasets cannot be read with open_zarr() to only a few seconds. I believe this could be done by re-implementing consolidation of metadata for existing zarr stores -- only a few entries in the consolidated metadata need to be changed for the new array sizes.
  3. Consider improve the error message when datasets cannot be opened. Currently users will get a very vague ValueError. If instead we temporarily moved the existing Zarr store into a new path, updated it in the new path, and then moved it back, they would get a message about the Zarr dataset not being found. If we enable hierarchical namespaces for our GCS buckets, then these rename operations at the folder level could be performed in an atomic fashion. This might even be fast enough that a user would only notice a slight delay opening the data, thanks to automatic retries.
shoyer commented 1 month ago

A simpler option than optimizing the resizing script is to resize all datasets once, e.g., to cover all dates through 2050.

Then we could add separate metadata attributes to the attributes of root groups to indicate when the dataset was last updated, and which times are valid, e.g.,

valid_time_start = '1940-01-01T00'
valid_time_stop = '2024-04-30T23'
last_updated ='2024-08-01T00:00:00'

Then the user should be able to select valid dates with non-missing data with something like the following:

import xarray

ds = xarray.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/model-level-1h-0p25deg.zarr-v1',
    chunks=None,
    storage_options=dict(token='anon'),
)
ds = ds.sel(time=slice(ds.attrs['valid_time_start'], ds.attrs['valid_time_stop']))

This alternative would be entirely free of race conditions, because chunks of data (other than the "time" dimension) can be written independently, and the presence of new data is indicated by metadata attributes on the root group, which can be updated atomically.

dabhicusp commented 2 weeks ago

Closed this issue as this is fixed in #84.