Problems with ATLAS datasets

sol1105 commented 7 months ago

clisops version: 0.12.2
Python version: -
Operating System: -

Description

The ATLAS datasets are aggregated CMIP5/6 or CORDEX datasets that have been remapped to a regular grid and contain data from multiple sources in a single data file (arranged along a new dimension member): Link1 Link2. It is planned that clisops supports the processing of these datasets in the future. First tests show the following problems:

Multiple fill values are defined for the data variable. This is already supported with #309
Processing of the data seems to be working, however, writing the result to disk is not possible due to:
- The DRS (folder structure and file name structure) deviates from CMIP and CORDEX specifications, therefore the clisops filenamer has to be updated.
- However also the filenamer simple cannot write processed output to disk. This is due to a netCDF error, caused by the deflate settings of string/character variables in the ATLAS datasets.

What I Did

ds=xr.open_dataset("sst_CMIP6_ssp245_mon_201501-210012.nc")
ds.sst.encoding["_FillValue"]=None   # required, since multiple fill values are defined for the data variable
ds.to_netcdf("atlas.nc")

This fails with:

RuntimeError                              Traceback (most recent call last)
---->  ds.to_netcdf("atlas.nc")

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/core/dataset.py:1911, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   1908     encoding = {}
   1909 from xarray.backends.api import to_netcdf
-> 1911 return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
   1912     self,
   1913     path,
   1914     mode=mode,
   1915     format=format,
   1916     group=group,
   1917     engine=engine,
   1918     encoding=encoding,
   1919     unlimited_dims=unlimited_dims,
   1920     compute=compute,
   1921     multifile=False,
   1922     invalid_netcdf=invalid_netcdf,
   1923 )

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/backends/api.py:1217, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1212 # TODO: figure out how to refactor this logic (here and in save_mfdataset)
   1213 # to avoid this mess of conditionals
   1214 try:
   1215     # TODO: allow this work (setting up the file for writing array data)
   1216     # to be parallelized with dask
-> 1217     dump_to_store(
   1218         dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
   1219     )
   1220     if autoclose:
   1221         store.close()

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/backends/api.py:1264, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1261 if encoder:
   1262     variables, attrs = encoder(variables, attrs)
-> 1264 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/backends/common.py:271, in AbstractWritableDataStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    269 self.set_attributes(attributes)
    270 self.set_dimensions(variables, unlimited_dims=unlimited_dims)
--> 271 self.set_variables(
    272     variables, check_encoding_set, writer, unlimited_dims=unlimited_dims
    273 )

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/backends/common.py:309, in AbstractWritableDataStore.set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
    307 name = _encode_variable_name(vn)
    308 check = vn in check_encoding_set
--> 309 target, source = self.prepare_variable(
    310     name, v, check, unlimited_dims=unlimited_dims
    311 )
    313 writer.add(source, target)

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:488, in NetCDF4DataStore.prepare_variable(self, name, variable, check_encoding, unlimited_dims)
    486     nc4_var = self.ds.variables[name]
    487 else:
--> 488     nc4_var = self.ds.createVariable(
    489         varname=name,
    490         datatype=datatype,
    491         dimensions=variable.dims,
    492         zlib=encoding.get("zlib", False),
    493         complevel=encoding.get("complevel", 4),
    494         shuffle=encoding.get("shuffle", True),
    495         fletcher32=encoding.get("fletcher32", False),
    496         contiguous=encoding.get("contiguous", False),
    497         chunksizes=encoding.get("chunksizes"),
    498         endian="native",
    499         least_significant_digit=encoding.get("least_significant_digit"),
    500         fill_value=fill_value,
    501     )
    503 nc4_var.setncatts(attrs)
    505 target = NetCDF4ArrayWrapper(name, self)

File src/netCDF4/_netCDF4.pyx:2962, in netCDF4._netCDF4.Dataset.createVariable()

File src/netCDF4/_netCDF4.pyx:4202, in netCDF4._netCDF4.Variable.__init__()

File src/netCDF4/_netCDF4.pyx:2029, in netCDF4._netCDF4._ensure_nc_success()

RuntimeError: NetCDF: Filter error: bad id or parameters or duplicate filter

It works when overwriting the encoding settings of the character/string variables introduced in the ATLAS datasets:

ds=xr.open_dataset("/sst_CMIP6_ssp245_mon_201501-210012.nc")
ds.sst.encoding["_FillValue"]=None
for cvar in ["member_id", "gcm_variant", "gcm_model", "gcm_institution"]:
    for en in ["zlib", "shuffle", "complevel"]:
        del ds[cvar].encoding[en]
ds.to_netcdf("atlas.nc")

cehbrecht commented 7 months ago

@sol1105 thanks for looking at the atlas issue :) That means we can support subsetting atlas when we add an "atlas-fix"?

sol1105 commented 7 months ago

@cehbrecht Yes, I think so. To be sure I would add further ATLAS test datasets (ATLAS CORDEX, CMIP5) and tests for subset, regrid (and if applicable also average) operators when we implement this fix.

I suggest a general fix in clisops: When reading in datasets, check for string/character variables, and, if present, remove any deflation options as in my above post (unless we gain further insight of what - netcdf application in xarray or netcdf or generally trying to compress character variables - causes this issue).

Should we raise this as an issue for netcdf (or possibly xarray) as well? In a v2 of ATLAS maybe the fillvalue and deflate problems should be addressed. Can you inform them about these issues?

Edit: Also cdo cannot open these files without problems, since the member dimension is the first dimension, which is not supported by cdo (it expects time as the first dimension)

sol1105 commented 7 months ago

I found some more information on that problem in the netcdf-python repo: https://github.com/Unidata/netcdf4-python/issues/1205 The problem has apparently been fixed in netcdf-c (main branch) and will be part of an upcoming 4.9.3 release: https://github.com/Unidata/netcdf-c/pull/2716

cofinoa commented 7 months ago

Hi, you are faced with two viable alternatives:

Utilize netcdf-c version <4.9, for writing the new files containing subsetted data. Despite no error being raised, it's important to note that filters are not being applied as expected and the process occurs silently.
Opt for netcdf-c version >=4.9, but with the caveat of removing all filters from String datatype variables to circumvent potential errors.

The ATLAS v1 dataset was crafted using netcdflib version 4.4.1.1 and hdf5lib version 1.10.1, a deliberate decision aimed at optimizing format readability with other library versions to the fullest extent possible.

sol1105 commented 7 months ago

@cofinoa Thanks for your reply. The netcdf_c PR I referenced above, suggests that string variables in the ATLAS dataset will not be readable in future netcdf-c releases:

The problem has apparently been fixed in netcdf-c (main branch) and will be part of an upcoming 4.9.3 release: https://github.com/Unidata/netcdf-c/pull/2716

So our planned xarray-fix becomes useless with future netcdf-c releases. The file metadata themselves have to be altered so they will remain fully readable, independent of the netcdf-c version.

cofinoa commented 7 months ago

@sol1105 the PR Unidata/netcdf-c#2716 just make "unreadable" VL datatype datasets/variables which filters are NO-OPTIONAL.

Therefore, the ATLAS v1 dataset will be readable for the next release, the filters applied to String variables are optional, and still will be readable for the next netcdf-c.

The netCDF library has strong principle to make readable any data been generated by previous library versions, for curation purposes.

What was problematic was Unidata/netcdf-c#2231 in netcdf-c version >=4.9. This PR "broke" code that write VL datatypes with filters and they were silently ignored and not applied, but the PR raises an error is being raised, make this code buggy.

The PR Unidata/netcdf-c#2716 in next release, will raise an error only if filter is NO OPTIONAL when VL dataype data will be written, and ignored and just warning user that filter is not applied when filter is OPTIONAL.

Said that, I will test ATLAS v1 with next netcdf-c release.

For the xarray-fix, the 3rd option would be use next netcdf-c release to write the subsetted data.

cofinoa commented 7 months ago

Update: just to confirm that the there is no issue with the last development version of netcdf-c 4.9.3:

netcdf library version 4.9.3-development of Jan 30 2024 17:53:03 $

We need to wait for the 4.9.3 release, but my conclusion it's to avoid netcdf-c version >=4.9 AND <4.9.3, because those versions break existing code that worked with previous versions (<4.9).

sol1105 commented 6 months ago

I think this can be closed with #319 and https://github.com/roocs/roocs-utils/pull/111 / https://github.com/roocs/roocs-utils/pull/113 . More info on the introduced changes/fixes can also be found there.

In general however the following issues should be addressed for future versions of the ATLAS datasets, since they may also affect the compatibility with other tools

Multiple inconsistent fill values defined (causing problems with xarray and ncview for example)
high compression level of 9 which greatly reduces the performance of subset operations (compression level 1 should be sufficient: https://github.com/PCMDI/cmor/issues/403)
Multiple inconsistent project labels defined (https://github.com/roocs/roocs-utils/pull/113#issue-2102490897) (making it hard to easily identify the dataset as ATLAS_v0 or ATLAS_v1)
CDO incompatibility (time not set as first dimension)

roocs / clisops

Problems with ATLAS datasets #317

Description

What I Did