pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Can't remove coordinates attribute from DataArrays #5510

Open ellesmith88 opened 3 years ago

ellesmith88 commented 3 years ago

What happened: Coordinates added to some variables unexpectedly.

I noticed this after outputting to netCDF. What I have:

variables:
        double time(time) ;
                time:bounds = "time_bnds" ;
                time:axis = "T" ;
                time:long_name = "valid_time" ;
                time:standard_name = "time" ;
                time:units = "days since 1850-01-01" ;
                time:calendar = "gregorian" ;
        double time_bnds(time, bnds) ;
                time_bnds:_FillValue = NaN ;
                time_bnds:coordinates = "reftime leadtime height" ;
        double lat(lat) ;
                lat:bounds = "lat_bnds" ;
                lat:units = "degrees_north" ;
                lat:axis = "Y" ;
                lat:long_name = "latitude" ;
                lat:standard_name = "latitude" ;
        double lat_bnds(lat, bnds) ;
                lat_bnds:_FillValue = NaN ;
                lat_bnds:coordinates = "reftime height" ;
        double lon(lon) ;
                lon:bounds = "lon_bnds" ;
                lon:units = "degrees_east" ;
                lon:axis = "X" ;
                lon:long_name = "Longitude" ;
                lon:standard_name = "longitude" ;
        double lon_bnds(lon, bnds) ;
                lon_bnds:_FillValue = NaN ;
                lon_bnds:coordinates = "reftime height" ;
        double height ;
                height:_FillValue = NaN ;
                height:units = "m" ;
                height:axis = "Z" ;
                height:positive = "up" ;
                height:long_name = "height" ;
                height:standard_name = "height" ;
        float tas(time, lat, lon) ;
                tas:_FillValue = 1.e+20f ;
                tas:standard_name = "air_temperature" ;
                tas:long_name = "Near-Surface Air Temperature" ;
                tas:comment = "near-surface (usually, 2 meter) air temperature" ;
                tas:units = "K" ;
                tas:cell_methods = "area: time: mean" ;
                tas:cell_measures = "area: areacella" ;
                tas:history = "2019-05-11T15:53:32Z altered by CMOR: Treated scalar dimension: \'height\'. 2019-05-11T15:53:32Z altered by CMOR: Reordered dimensions, original order: lat lon time." ;
                tas:coordinates = "height reftime leadtime" ;
                tas:missing_value = 1.e+20f ;
        int reftime ;
                reftime:long_name = "Start date of the forecast" ;
                reftime:standard_name = "forecast_reference_time" ;
                reftime:units = "days since 1850-01-01" ;
                reftime:calendar = "gregorian" ;
        double leadtime(time) ;
                leadtime:_FillValue = NaN ;
                leadtime:long_name = "Time elapsed since the start of the forecast" ;
                leadtime:standard_name = "forecast_period" ;
                leadtime:units = "days" ;
        int realization ;
                realization:long_name = "realization" ;
                realization:comment = "For more information on the ripf, refer to the variant_label, initialization_description, physics_description and forcing_description global attributes" ;
                realization:coordinates = "reftime height" ;

On time_bnds, lon_bnds, lat_bnds and realization there is coordinates that I wouldn't expect to be there.

What you expected to happen: Looking only at the coordinates attribute, I expected my ncdump to show:

variables:
        int reftime ;
                reftime:long_name = "Start date of the forecast" ;
                reftime:standard_name = "forecast_reference_time" ;
                reftime:units = "days since 1850-01-01" ;
                reftime:calendar = "gregorian" ;
        double leadtime(time) ;
                leadtime:long_name = "Time elapsed since the start of the forecast" ;
                leadtime:standard_name = "forecast_period" ;
                leadtime:units = "days" ;
        int realization ;
                realization:long_name = "realization" ;
                realization:comment = "For more information on the ripf, refer to the variant_label, initialization_description, physics_description and forcing_description global attributes" ;
        double time(time) ;
                time:bounds = "time_bnds" ;
                time:axis = "T" ;
                time:standard_name = "time" ;
                time:units = "days since 1850-01-01" ;
                time:calendar = "gregorian" ;
                time:long_name = "valid_time" ;
        double time_bnds(time, bnds) ;
                time_bnds:units = "days since 1850-01-01" ;
        double lat(lat) ;
                lat:bounds = "lat_bnds" ;
                lat:units = "degrees_north" ;
                lat:axis = "Y" ;
                lat:long_name = "latitude" ;
                lat:standard_name = "latitude" ;
        double lat_bnds(lat, bnds) ;
        double lon(lon) ;
                lon:bounds = "lon_bnds" ;
                lon:units = "degrees_east" ;
                lon:axis = "X" ;
                lon:long_name = "Longitude" ;
                lon:standard_name = "longitude" ;
        double lon_bnds(lon, bnds) ;
        double height ;
                height:units = "m" ;
                height:axis = "Z" ;
                height:positive = "up" ;
                height:long_name = "height" ;
                height:standard_name = "height" ;
        float tas(time, lat, lon) ;
                tas:standard_name = "air_temperature" ;
                tas:long_name = "Near-Surface Air Temperature" ;
                tas:comment = "near-surface (usually, 2 meter) air temperature" ;
                tas:units = "K" ;
                tas:cell_methods = "area: time: mean" ;
                tas:cell_measures = "area: areacella" ;
                tas:history = "2019-05-11T15:53:32Z altered by CMOR: Treated scalar dimension: \'height\'. 2019-05-11T15:53:32Z altered by CMOR: Reordered dimensions, original order: lat lon time." ;
                tas:missing_value = 1.e+20f ;
                tas:_FillValue = 1.e+20f ;
                tas:coordinates = "height reftime leadtime" ;

I tried to remove this in the xarray dataset, but whatever I tried they always ended up back in there:

>>> import xarray as xr
>>> ds = xr.open_dataset("file.nc", use_cftime=True)

# show coords on realization
>>> ds.realization
<xarray.DataArray 'realization' ()>
array(1, dtype=int32)
Coordinates:
    height   float64 ...
    reftime  object ...
Attributes:
    long_name:  realization
    comment:    For more information on the ripf, refer to the variant_label,...

# try reset_coords - removes the coords
>>> ds.realization.reset_coords(names=["height", "reftime"], drop=True)
<xarray.DataArray 'realization' ()>
array(1, dtype=int32)
Attributes:
    long_name:  realization
    comment:    For more information on the ripf, refer to the variant_label,...

# set realization with result of reset_coords
>>> ds["realization"] = ds.realization.reset_coords(names=["height", "reftime"], drop=True)

# coords back in
>>> ds.realization
<xarray.DataArray 'realization' ()>
array(1, dtype=int32)
Coordinates:
    height   float64 ...
    reftime  object ...
Attributes:
    long_name:  realization
    comment:    For more information on the ripf, refer to the variant_label,...

# try drop_vars - same thing happens
>>> ds.realization.drop_vars(("height", "reftime"))
<xarray.DataArray 'realization' ()>
array(1, dtype=int32)
Attributes:
    long_name:  realization
    comment:    For more information on the ripf, refer to the variant_label,...

>>> ds["realization"] = ds.realization.drop_vars(("height", "reftime"))

>>> ds.realization
<xarray.DataArray 'realization' ()>
array(1, dtype=int32)
Coordinates:
    height   float64 ...
    reftime  object ...
Attributes:
    long_name:  realization
    comment:    For more information on the ripf, refer to the variant_label,...

# tried creating a new variable to see if the same thing happens  - it does
>>> ds["test"] = ds.realization.drop_vars(("height", "reftime")) 
>>> ds.test
<xarray.DataArray 'test' ()>
array(1, dtype=int32)
Coordinates:
    height   float64 ...
    reftime  object ...
Attributes:
    long_name:  realization
    comment:    For more information on the ripf, refer to the variant_label,...

This seems like incorrect behaviour, but perhaps it is expected?

Minimal Complete Verifiable Example:

>>> data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
>>> ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "fake": 10})
>>> ds = ds.assign_coords({"reftime":np.array("2004-11-01T00:00:00", dtype=np.datetime64)}) 
>>> ds = ds.assign({"test": 1})

>>> ds.test
<xarray.DataArray 'test' ()>
array(1)
Coordinates:
    reftime  datetime64[ns] 2004-11-01

>>> ds.test.reset_coords(names=["reftime"], drop=True)
<xarray.DataArray 'test' ()>
array(1)

>>> ds["test"] = ds.test.reset_coords(names=["reftime"], drop=True)
>>> ds.test
<xarray.DataArray 'test' ()>
array(1)
Coordinates:
    reftime  datetime64[ns] 2004-11-01

 ds.to_netcdf("file.nc")
ncdump -h file.nc
netcdf file {
dimensions:
    x = 2 ;
    y = 3 ;
variables:
    int64 x(x) ;
    double foo(x, y) ;
        foo:_FillValue = NaN ;
        foo:coordinates = "reftime" ;
    int64 bar(x) ;
        bar:coordinates = "reftime" ;
    int64 fake ;
        fake:coordinates = "reftime" ;
    int64 reftime ;
        reftime:units = "days since 2004-11-01 00:00:00" ;
        reftime:calendar = "proleptic_gregorian" ;
    int64 test ;
        test:coordinates = "reftime" ;
}

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 16:54:48) [Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.10.5 libnetcdf: 4.6.3 xarray: 0.18.2 pandas: 1.1.3 numpy: 1.19.2 scipy: None netCDF4: 1.5.4 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2.30.0 distributed: 2.30.0 matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 54.1.1 pip: 21.0.1 conda: None pytest: 6.2.2 IPython: 7.21.0 sphinx: 1.8.1
max-sixty commented 3 years ago

Coords are on the dataset rather than the variable, so it requires something like ds = ds.reset_coords()

Thanks for the clear example!

dcherian commented 3 years ago

I am not sure there is a good solution here.

As @max-sixty says, xarray does not carry around information that links a specific variable to a subset of coordinate variables (like a coordinates attribute). So in effect, all variables are associated with all coordinate variables.

We could special-case bounds variables (we do this in some places in the encoding/decoding functions) but realization will always be a problem. I think you can override it by explicitly setting the coordinates="" attribute. Perhaps .attrs["coordinates"] = None will avoid writing the attribute, if not we could make it work?

cc @DWesl

ellesmith88 commented 3 years ago

@dcherian thanks! I've tried overriding with .attrs["coordinates"] = None, .attrs["coordinates"] = "" and .attrs["coordinates"] = False but that doesn't seem to work either. Would be great if that worked and would solve my problem!

dcherian commented 3 years ago

This is the function to look at if you wanted to fix it: https://github.com/pydata/xarray/blob/1f5c63379e1a3e803515f467d9dcc0c9f7b1a0db/xarray/conventions.py#L707

DWesl commented 3 years ago

~encoding is where that information is stored between reading a dataset in from disk and saving it back out again.~ _encode_coordinates can take a default value from either of encoding or attrs, but a falsy value will be overwritten. Setting .attrs["coordinates"] = " " should work.

>>> import numpy as np, xarray as xr
>>> data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
>>> ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "fake": 10})
>>> ds = ds.assign_coords({"reftime":np.array("2004-11-01T00:00:00", dtype=np.datetime64)})
>>> ds = ds.assign({"test": 1})
>>> ds.test.encoding["coordinates"] = " "
>>> ds.to_netcdf("file.nc")
$ ncdump -h file.nc
netcdf file {
dimensions:
        x = 2 ;
        y = 3 ;
variables:
        int64 x(x) ;
        double foo(x, y) ;
                foo:_FillValue = NaN ;
                foo:coordinates = "reftime" ;
        int64 bar(x) ;
                bar:coordinates = "reftime" ;
        int64 fake ;
                fake:coordinates = "reftime" ;
        int64 reftime ;
                reftime:units = "days since 2004-11-01 00:00:00" ;
                reftime:calendar = "proleptic_gregorian" ;
        int64 test ;
                test:coordinates = " " ;
}

As mentioned above, the XArray data model associates coordinates with dimensions rather than with variables, so any time you read the dataset back in again, the test variable will gain reftime as a coordinate, because the dimensions of reftime (()), are a subset of the dimensions of test (also ()).

Not producing a coordinates attribute for variables mentioned in another variable's bounds attribute (or a few other attributes, for that matter) would be entirely doable within the function linked above, and should be straightforward if you want to make a PR for that.

Making realization and the bounds show up in ds.coords rather than ds.data_vars may also skip setting the coordinates attribute, though I'm less sure of that. It would, however, add realization to the coordinates attributes of every other data_var unless you overrode that, which may not be what you want.

claytharrison commented 1 year ago

Just wanted to mention that the PR solving this seems to only work when setting the encoding directly. Using the previous example:

>>> import numpy as np, xarray as xr
>>> data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
>>> ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "fake": 10})
>>> ds = ds.assign({"test": 1})
>>> ds.test.encoding["coordinates"] = None
>>> ds.to_netcdf("file.nc")

Is successful. However, trying to set the coordinates /attribute/ to None:

>>> import numpy as np, xarray as xr
>>> data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
>>> ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "fake": 10})
>>> ds = ds.assign({"test": 1})
>>> ds.test.attrs["coordinates"] = None
>>> ds.to_netcdf("file.nc")

throws TypeError: Invalid value for attr 'coordinates': None. For serialization to netCDF files, its value must be of one of the following types: str, Number, ndarray, number, list, tuple. This also occurs when using assign_attrs, as in ds["test"] = ds["test"].assign_attrs({"coordinates": None}).

Setting encoding using the encoding parameter of to_netcdf also doesn't work for me:

>>> import numpy as np, xarray as xr
>>> data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
>>> ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "fake": 10})
>>> ds = ds.assign({"test": 1})
>>> ds.to_netcdf("file.nc", encoding={"test": {"coordinates": None}})

throws unexpected encoding parameters for 'netCDF4' backend: ['coordinates']. Valid encodings are: {'least_significant_digit', 'dtype', 'zlib', '_FillValue', 'contiguous', 'chunksizes', 'complevel', 'shuffle', 'fletcher32'}

This one makes sense I suppose, but it's a bit confusing when coordinates can be set to None via the data variable's encoding properties.

kmuehlbauer commented 1 year ago

@claytharrison This looks like a bug, because the error is raised in a check before we even enter conventions.py where #5514 has the supposed fix.

kmuehlbauer commented 1 year ago

I'm wondering if just removing coordinates from attrs/encoding would be a better way of getting rid of it.

kmuehlbauer commented 1 year ago

OK, I'm sure it's no good idea to allow None as valid for serializing. But why not use False instead. A bool would be propagated down to the place where #5514 checkes for None. Just need to check for False then. @claytharrison Would you mind opening a PR with the needed changes?

claytharrison commented 1 year ago

@kmuehlbauer Absolutely, should I have it check for either False or None, so as not to break anything, or just switch from None to False?

kmuehlbauer commented 1 year ago

@claytharrison The current test from #5514 just checks if Dataset can be encoded/decoded without actually roundtripping via disk. I'm inclined to just switch from None to False and adding a real roundtrip test in test_backend.py. But we can get other devs insight over at your PR.