google-research / arco-era5

Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets.
https://cloud.google.com/storage/docs/public-datasets/era5
Apache License 2.0
287 stars 22 forks source link

Update README dataset description #74

Closed shoyer closed 2 months ago

shoyer commented 2 months ago

I've reorganized the README to separately introduce "Analysis Ready" and "Cloud Optimized" datasets, which an expectation that users will be most interested in the former.

I've also updated all datasets with size and chunking information, generating with the following snippet:

import xarray_beam
import math

def get_size(x):
  for threshold, units in [
      (1e6, 'MB'),
      (1e9, 'GB'),
      (1e12, 'TB'),
      (1e15, 'PB'),
  ]:
    if x < threshold * 1000:
      return x/threshold, units
  raise RuntimeError('unhandled size')

for path in [
    'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3',
    'gs://gcp-public-data-arco-era5/ar/model-level-1h-0p25deg.zarr-v1',
    'gs://gcp-public-data-arco-era5/co/model-level-wind.zarr-v2',
    'gs://gcp-public-data-arco-era5/co/model-level-moisture.zarr-v2',
    'gs://gcp-public-data-arco-era5/co/single-level-surface.zarr-v2',
    'gs://gcp-public-data-arco-era5/co/single-level-reanalysis.zarr-v2',
    'gs://gcp-public-data-arco-era5/co/single-level-forecast.zarr-v2', 
]:
  ds, chunks = xarray_beam.open_zarr(
      path, storage_options=dict(token='anon')
  )
  print()
  print(path)
  size, units = get_size(ds.sel(time=slice("1940", None)).nbytes)
  print(f'Total size (1940-present): {size:.3g} {units}')
  print('Chunks:', chunks)
  size, units = get_size(4*math.prod(chunks.values()))
  print(f'Chunk size: {size:.3g} {units}')
  print(f'Last time: {ds.indexes["time"][-1]}')

This currently outputs:

gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3
Total size (1940-present): 2.05 PB
Chunks: {'time': 1, 'latitude': 721, 'longitude': 1440, 'level': 37}
Chunk size: 154 MB
Last time: 2024-03-31 23:00:00

gs://gcp-public-data-arco-era5/ar/model-level-1h-0p25deg.zarr-v1
Total size (1940-present): 5.88 PB
Chunks: {'time': 1, 'hybrid': 18, 'latitude': 721, 'longitude': 1440}
Chunk size: 74.8 MB
Last time: 2024-03-31 23:00:00

gs://gcp-public-data-arco-era5/co/model-level-wind.zarr-v2
Total size (1940-present): 664 TB
Chunks: {'time': 1, 'hybrid': 1, 'values': 410240}
Chunk size: 1.64 MB
Last time: 2024-03-31 23:00:00

gs://gcp-public-data-arco-era5/co/model-level-moisture.zarr-v2
Total size (1940-present): 1.54 PB
Chunks: {'time': 1, 'hybrid': 1, 'values': 542080}
Chunk size: 2.17 MB
Last time: 2024-03-31 23:00:00

gs://gcp-public-data-arco-era5/co/single-level-surface.zarr-v2
Total size (1940-present): 2.42 TB
Chunks: {'time': 1, 'values': 410240}
Chunk size: 1.64 MB
Last time: 2024-03-31 23:00:00

gs://gcp-public-data-arco-era5/co/single-level-reanalysis.zarr-v2
Total size (1940-present): 60.9 TB
Chunks: {'time': 1, 'values': 542080}
Chunk size: 2.17 MB
Last time: 2024-03-31 23:00:00

gs://gcp-public-data-arco-era5/co/single-level-forecast.zarr-v2
Total size (1940-present): 53.2 TB
Chunks: {'time': 1, 'step': 1, 'values': 542080}
Chunk size: 2.17 MB
Last time: 2024-03-31 18:00:00