google-research / arco-era5

Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets.
https://cloud.google.com/storage/docs/public-datasets/era5
Apache License 2.0
287 stars 22 forks source link

NaNs on last days of months #78

Closed nshankar closed 2 months ago

nshankar commented 2 months ago

Here is a quick way to replicate:

import xarray as xr

ds = xr.open_zarr(
    "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
    chunks=None,
    storage_options=dict(token="anon"),
)
times = [
    "2023-08-31T00:00:00.000000000",
    "2023-09-30T00:00:00.000000000",
    "2023-10-31T00:00:00.000000000",
    "2023-11-30T00:00:00.000000000",
    "2023-12-31T00:00:00.000000000",
    "2024-01-31T00:00:00.000000000",
]
var = "u_component_of_wind"

for time in times:
    slice = ds.sel(time=time)[var]
    print(
        f"{slice.isnull().sum().compute().values} NaNs out of {slice.size} at time {time}"
    )

This script only checks u_component_of_wind at midnight but this appears to effect many variables and multiple time slices on these days. Some other variables I've noticed this for are

10m_u_component_of_wind
10m_v_component_of_wind
2m_dewpoint_temperature
2m_temperature
geopotential
surface_pressure
temperature
total_cloud_cover
total_column_water
total_column_water_vapour
u_component_of_wind
v_component_of_wind
vertical_velocity
volumetric_soil_water_layer_1
volumetric_soil_water_layer_2
volumetric_soil_water_layer_3
volumetric_soil_water_layer_4
shoyer commented 2 months ago

Thanks for the report! We are looking into this. I suspect it may be an issue with our incremental update scripts.

dabhicusp commented 2 months ago

Thank you @nshankar for identifying this issue. I already figured out why this issue occurred and working on the solution.

dabhicusp commented 2 months ago

So the issue is with the last day of each month from when the automatic monthly data appending script is running(basically from aug/2023).

Detailed date in which actual data is not added: [ (2023, 8, 31), (2023, 9, 30), (2023, 10, 31), (2023, 11, 30), (2023, 12, 31), (2024, 1, 31), (2024, 2, 29), (2024, 3, 31), (2024, 4, 30), (2024, 4, 30) ]

I've updated the data in the Zarr file for the missing data. Also modified the script to incorporate this.

I'm marking this issue as resolved, but if you find any more problems, let me know.

BTW thanks once again @nshankar for raising the issue.