roocs / clisops

Climate Simulation Operations
https://clisops.readthedocs.io/en/latest/
Other
21 stars 9 forks source link

Note about using xarray open_mfdataset #88

Open agstephens opened 3 years ago

agstephens commented 3 years ago

DKRZ are loading CMIP6 into Zarr. Here are some of their experiences with xarray.open_mfdataset:

One problem with the following line:

    ds = xarray.open_mfdataset(catvar.df["path"].to_list(), use_cftime=True, combine="by_coords")

Xarray does not interpret the bounds keyword so that the corresponding lat and lon bounds are listed as data variables. That might not cause any problem, but on top of that, xarray adds a time dimension to that variables:

    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(1826, 192, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(1826, 384, 2), meta=np.ndarray>

DKRZ used:

xarray.open_mfdataset(catvar.df["path"].to_list(),
                               decode_cf=True,
                               concat_dim="time",
                               data_vars='minimal', 
                               coords='minimal', 
                               compat='override')

From the xarray tutorial so that there is no time dimension anymore for the bnds. They had not included use_cftime , which might cause other problems as I saw now when reconverting it to netCDF.

sol1105 commented 2 years ago

The problem with the added time dimension for bounds variables can be avoided using the parameter decode_coords="all": ds = xarray.open_mfdataset("/path/to/files/*.nc", decode_coords="all")

However, there is another problem related to xarray.open_mfdataset: The encoding dictionary gets lost somewhere during the merging operation of the datasets of the respective files (https://github.com/pydata/xarray/issues/2436).

This leads to problems for example with cf-xarray when trying to detect coordinates or bounds, but also leads to problems related to the time axis encoding apparently (as seen in the linked issue). I managed at least to avoid the problems for cf-xarray bounds and coordinates detection by using the decode functionality of xarray only after the datasets have been read in (leaving however the unnecessary time dimension in place ...):

ds = xarray.open_mfdataset("/path/to/files/*.nc")
ds = xarray.decode_cf(ds, decode_coords="all")