Closed sol1105 closed 6 months ago
@sol1105 thanks for looking at the atlas issue :) That means we can support subsetting atlas when we add an "atlas-fix"?
@cehbrecht Yes, I think so. To be sure I would add further ATLAS test datasets (ATLAS CORDEX, CMIP5) and tests for subset, regrid (and if applicable also average) operators when we implement this fix.
I suggest a general fix in clisops: When reading in datasets, check for string/character variables, and, if present, remove any deflation options as in my above post (unless we gain further insight of what - netcdf application in xarray or netcdf or generally trying to compress character variables - causes this issue).
Should we raise this as an issue for netcdf (or possibly xarray) as well? In a v2 of ATLAS maybe the fillvalue and deflate problems should be addressed. Can you inform them about these issues?
Edit: Also cdo cannot open these files without problems, since the member
dimension is the first dimension, which is not supported by cdo (it expects time
as the first dimension)
I found some more information on that problem in the netcdf-python
repo:
https://github.com/Unidata/netcdf4-python/issues/1205
The problem has apparently been fixed in netcdf-c
(main branch) and will be part of an upcoming 4.9.3 release:
https://github.com/Unidata/netcdf-c/pull/2716
Hi, you are faced with two viable alternatives:
The ATLAS v1 dataset was crafted using netcdflib version 4.4.1.1 and hdf5lib version 1.10.1, a deliberate decision aimed at optimizing format readability with other library versions to the fullest extent possible.
@cofinoa Thanks for your reply. The netcdf_c
PR I referenced above, suggests that string variables in the ATLAS dataset will not be readable in future netcdf-c
releases:
The problem has apparently been fixed in netcdf-c (main branch) and will be part of an upcoming 4.9.3 release: https://github.com/Unidata/netcdf-c/pull/2716
So our planned xarray
-fix becomes useless with future netcdf-c
releases. The file metadata themselves have to be altered so they will remain fully readable, independent of the netcdf-c
version.
@sol1105 the PR Unidata/netcdf-c#2716 just make "unreadable" VL datatype datasets/variables which filters are NO-OPTIONAL.
Therefore, the ATLAS v1 dataset will be readable for the next release, the filters applied to String variables are optional, and still will be readable for the next netcdf-c.
The netCDF library has strong principle to make readable any data been generated by previous library versions, for curation purposes.
What was problematic was Unidata/netcdf-c#2231 in netcdf-c version >=4.9. This PR "broke" code that write VL datatypes with filters and they were silently ignored and not applied, but the PR raises an error is being raised, make this code buggy.
The PR Unidata/netcdf-c#2716 in next release, will raise an error only if filter is NO OPTIONAL when VL dataype data will be written, and ignored and just warning user that filter is not applied when filter is OPTIONAL.
Said that, I will test ATLAS v1 with next netcdf-c release.
For the xarray-fix, the 3rd option would be use next netcdf-c release to write the subsetted data.
Update: just to confirm that the there is no issue with the last development version of netcdf-c 4.9.3:
netcdf library version 4.9.3-development of Jan 30 2024 17:53:03 $
We need to wait for the 4.9.3 release, but my conclusion it's to avoid netcdf-c version >=4.9 AND <4.9.3, because those versions break existing code that worked with previous versions (<4.9).
I think this can be closed with #319 and https://github.com/roocs/roocs-utils/pull/111 / https://github.com/roocs/roocs-utils/pull/113 . More info on the introduced changes/fixes can also be found there.
In general however the following issues should be addressed for future versions of the ATLAS datasets, since they may also affect the compatibility with other tools
xarray
and ncview
for example)9
which greatly reduces the performance of subset operations (compression level 1 should be sufficient: https://github.com/PCMDI/cmor/issues/403)time
not set as first dimension)
Description
The ATLAS datasets are aggregated CMIP5/6 or CORDEX datasets that have been remapped to a regular grid and contain data from multiple sources in a single data file (arranged along a new dimension
member
): Link1 Link2. It is planned thatclisops
supports the processing of these datasets in the future. First tests show the following problems:clisops filenamer
has to be updated.filenamer simple
cannot write processed output to disk. This is due to a netCDF error, caused by the deflate settings of string/character variables in the ATLAS datasets.What I Did
This fails with:
It works when overwriting the encoding settings of the character/string variables introduced in the ATLAS datasets: