Closed sebastienlanglois closed 3 years ago
I didn't know about Zarr.js - that's super interesting, thanks for the info! Storage format is Zarr under the hood and JSON for REST API purpose - was thinking most users would use it with Pandas or R scripts although I'm not too sure what the users actually end up using.
Yes, I think it would work well with JavaScript visualization tools. I've made this sample app (https://climate-explorer.oikolab.com/) in with Plotly Dash (in Python) to show how it loads time-series data - not quite responsive as typical JavaScript visualization but where you can sort of get an idea of data loading speed.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.
I am finding 2 issues with the ERA5 data. (1) I believe there may be corruption in the Pangeo zarr ERA5 datastore from 4/21 - 4/30 1983. I suggest that we should set these to NaN for now. I know the issue is in variable t2m but have not looked at other variables. (2) The ERA5 data over Antarctica looks intermittently wrong. ERA5 known issues are documented here: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Knownissues but I do not see any mention of what I'm seeing over Antarctica, so I will work more on this and report it to them.
I have made a Jupyter notebook here that shows the issues: https://nbviewer.jupyter.org/github/cgentemann/cloud_science/blob/master/Cloud_testing/era5_pangeo_1983.ipynb
Hi @cgentemann ! Just wanted to add that I recently found out about these issues as well. Just to add my 2 cents :
I've found a workaround to this problem that I cannot explain :
When reading the original ERA5 netcdf files and before converting to zarr, for each variable, I preprocess with the following line even if the dataarrays are already interpreted as float32 when opening with xarray :
ds[var] = ds[var].astype('float32')
Afterwards, when converting and reading from the resulting zarr dataset, there are no more bumps in the data. I've successfully created 2 worldwide ERA5 zarr datasets one chunked in space (continuous in time) and the other chunked in time (continuous in space) with 2 variables (t2m and tp) with this "workaround" and it's the only way that I've found so far that removed those bumps. Another interesting point is that adding .astype('float32') makes the final zarr chunks size slightly larger than without it, all else being equal.
Maybe we should ping AWS about whether they had issues? The AWS ERA5 data doesn't have the same issues: https://github.com/cgentemann/cloud_science/blob/master/Cloud_testing/era5_pangeo_ice_issue_AWS.ipynb @zflamig. It would be good to try and figure out what happened.... pinging @rabernat as this is concerning for pangeo-forge recipes. I guess this emphasizes the need for not just recipes, but robust testing to ensure that data-in == data-out.
@cgentemann I'm responsible for the AWS version of ERA5, so already pinged. As you correctly noted, there does not seem to be a similar problem in the AWS version, and I do not remember seeing anything similar. Our data pipeline seems to be different, however. We are not using ECMWF NetCDF file, but GRIB, convert it to NetCDF4 with the custom program and then convert this NetCDF4 to ZARR. IMO the NetCDF support has never been a priority for ECMWF and it is better to use the original GRIB, than the awkward 64-bit offset file they provide.
@sebastienlanglois I think re-downloading the file from CDS does not necessarily create a new file, you may still be getting the same file from the cache, so it is still possible that the file is corrupted in some way. At least it has happened to me that when downloading monthly data from CDS, I do get a several days old file, which does not even correspond to the newest available data. So I suggest converting the NetCDF to a proper NetCDF4 file first with either nccopy
of CDM java library, and after confirming the data is ok, feeding it to zarr converter.
Thanks @aluhamaa for this insight. It's interesting (and scary) that the netcdf files from CDS are read correctly with xarray and don't have any bumps, but that those appear (randomly?) only when converting to zarr. I will look into reproducing your pipeline ( GRIB > Netcdf4 > Zarr) and report back to confirm if netcdfs from CDS are indeed the issues here.
To @cgentemann's point, I also agree that we should do robust testing on the data. While this issue seems pretty unique so far, it might not be the only dataset concerned down the road. That being said, unless a large cluster with tons of memory is available, I would favor to assert equality as an option at the chunk level during conversion (while the original files to create that chunk are still in memory) rather than after the entire zarr dataset has been converted. From my experience, it's pretty difficult to assert equality on an entire fully-rechunked and TB-size zarr dataset vs. the the original files but that will depend mostly on the extent to which the dataset was chunked.
I'm not sure if this is related but when converting from NetCDF to zarr, I saw similar issue (in Antarctic region) where some of the data didn't make sense. I was using to read multiple NetCDF files and combining to create a single Zarr store but when combining the NetCDF files, the first NetCDF seems to set a bound (i.e. -50C and +50C) and anytime this value was exceeded in the subsequent NetCDF file I would see this error. So places that this showed up were usually really cold like Antarctica or really hot places like Phoenix.
@sebastienlanglois Can you share the NetCDF file that causes the problem? I agree it seems interesting enough to deserve more attention and I'd like to try to reproduce it. Testing is even more difficult. Just to give another example of what could happen: a few years ago the library I was using for reading GRIB files started to give noise for some fields, and it was not discovered by our QA process, because the data corruption was already in the very first reading process. So 1:1 comparison does not necessarily lead to the discovery of the problem.
My guess is that the difference in data issues is coming from when the data was obtained (see here for a list of ERA-5 resolved data issues).
We are using hourly t2m and precip ERA-5 data that we downloaded as NetCDFs on the regular Gaussian F320 grid, which we then aggregate to daily min and max temperature and daily precipitation and convert to zarr stores to use as a reference dataset for global bias correction of CMIP6. We had originally obtained ERA-5 data in late 2019 but then later re-downloaded hourly ERA-5 t2m and pr data files after reading about those data issues, as well as realizing we wanted hourly data on the F320 grid rather than interpolated by Copernicus.
After reading @cgentemann's post above last month, I did some similar verification of our zarr stores to make sure that we don't have that issue over Antarctica, and we don't, but this is because we'd redownloaded the data files.
Our zarr stores include daily tasmax, tasmin, and pr data from 1995 - 2014, and if this is useful for others, I'd be happy to upload these to the pangeo bucket. These zarrs currently live on GCS and Azure. At this time the zarrs don't include data pre-1995 or post-2014 but we do have this data on GCS so a future version of the zarrs could include them.
Hi everyone,
I'm a hydrological engineer who's been using xarray/dask for quite a while now and I was very pleased when I stumbled upon the pangeo project a couple of weeks ago!
I have downloaded about 25 variables from the ERA5 dataset from 1979 to 2019 and would be happy to upload them to the pangeo-data bucket in the zarr format. I was wondering what was the policy regarding external contribution of datasets as I'm not (yet) a contributor of this project.
Thank you!