google-research / arco-era5

Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets.
https://cloud.google.com/storage/docs/public-datasets/era5
Apache License 2.0
287 stars 22 forks source link

NaNs in 6-hourly analysis-ready dataset for 2m temperature` #62

Closed tom-andersson closed 2 weeks ago

tom-andersson commented 10 months ago

Hi there! I've come across NaNs in the 2m_temperature variable in the 6-hourly analysis-ready dataset -- MWE below -- does this reproduce for you?

Three strange observations:

source = "gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2" era5_zarr = xr.open_zarr(source, consolidated=True, chunks={"time": 48}) era5_zarr["2m_temperature"].sel(time="2015-06-28").load()

Returns

<xarray.DataArray '2m_temperature' (time: 4, latitude: 721, longitude: 1440)> array([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], [[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], [[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], [[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32) Coordinates:

dabhicusp commented 10 months ago

Yes @tom-andersson it's reproducible for us too.

jfyi -- we are only maintaining this file gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/ and also we will be deprecating all files other than this (gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/) in future.

tom-andersson commented 10 months ago

Thanks for confirming @dabhicusp - I'll switch to the 1-hourly dataset in my application.

The reason I was using the 6-hourly dataset was partly out of laziness to reduce download size/duration when I only want daily averages for testing environmental ML, see https://github.com/google-research/arco-era5/issues/61.

dabhicusp commented 9 months ago

Hello @tom-andersson If your tasks have been successfully completed, could we proceed with closing this issue?

tom-andersson commented 9 months ago

Hi @dabhicusp, yes, since the dataset with NaNs isn't being maintained, feel free to close this. Though it could be useful to make this more clear in the docs or remove the dataset from the cloud bucket (if you haven't already).

dabhicusp commented 2 weeks ago

I'm closing this issue because we're only keeping the files that are mentioned in the readme.md file. Any other files that aren't listed there will be getting the deprecated in the future.