Open ellesmith88 opened 3 years ago
Coords are on the dataset rather than the variable, so it requires something like ds = ds.reset_coords()
Thanks for the clear example!
I am not sure there is a good solution here.
As @max-sixty says, xarray does not carry around information that links a specific variable to a subset of coordinate variables (like a coordinates attribute). So in effect, all variables are associated with all coordinate variables.
We could special-case bounds variables (we do this in some places in the encoding/decoding functions) but realization
will always be a problem. I think you can override it by explicitly setting the coordinates=""
attribute. Perhaps .attrs["coordinates"] = None
will avoid writing the attribute, if not we could make it work?
cc @DWesl
@dcherian thanks! I've tried overriding with .attrs["coordinates"] = None
, .attrs["coordinates"] = ""
and .attrs["coordinates"] = False
but that doesn't seem to work either. Would be great if that worked and would solve my problem!
This is the function to look at if you wanted to fix it: https://github.com/pydata/xarray/blob/1f5c63379e1a3e803515f467d9dcc0c9f7b1a0db/xarray/conventions.py#L707
~encoding
is where that information is stored between reading a dataset in from disk and saving it back out again.~ _encode_coordinates
can take a default value from either of encoding
or attrs
, but a falsy value will be overwritten. Setting .attrs["coordinates"] = " "
should work.
>>> import numpy as np, xarray as xr
>>> data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
>>> ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "fake": 10})
>>> ds = ds.assign_coords({"reftime":np.array("2004-11-01T00:00:00", dtype=np.datetime64)})
>>> ds = ds.assign({"test": 1})
>>> ds.test.encoding["coordinates"] = " "
>>> ds.to_netcdf("file.nc")
$ ncdump -h file.nc
netcdf file {
dimensions:
x = 2 ;
y = 3 ;
variables:
int64 x(x) ;
double foo(x, y) ;
foo:_FillValue = NaN ;
foo:coordinates = "reftime" ;
int64 bar(x) ;
bar:coordinates = "reftime" ;
int64 fake ;
fake:coordinates = "reftime" ;
int64 reftime ;
reftime:units = "days since 2004-11-01 00:00:00" ;
reftime:calendar = "proleptic_gregorian" ;
int64 test ;
test:coordinates = " " ;
}
As mentioned above, the XArray data model associates coordinates with dimensions rather than with variables, so any time you read the dataset back in again, the test
variable will gain reftime
as a coordinate, because the dimensions of reftime
(()
), are a subset of the dimensions of test
(also ()
).
Not producing a coordinates
attribute for variables mentioned in another variable's bounds
attribute (or a few other attributes, for that matter) would be entirely doable within the function linked above, and should be straightforward if you want to make a PR for that.
Making realization
and the bounds show up in ds.coords
rather than ds.data_vars
may also skip setting the coordinates
attribute, though I'm less sure of that. It would, however, add realization
to the coordinates
attributes of every other data_var
unless you overrode that, which may not be what you want.
Just wanted to mention that the PR solving this seems to only work when setting the encoding directly. Using the previous example:
>>> import numpy as np, xarray as xr
>>> data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
>>> ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "fake": 10})
>>> ds = ds.assign({"test": 1})
>>> ds.test.encoding["coordinates"] = None
>>> ds.to_netcdf("file.nc")
Is successful. However, trying to set the coordinates /attribute/ to None:
>>> import numpy as np, xarray as xr
>>> data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
>>> ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "fake": 10})
>>> ds = ds.assign({"test": 1})
>>> ds.test.attrs["coordinates"] = None
>>> ds.to_netcdf("file.nc")
throws TypeError: Invalid value for attr 'coordinates': None. For serialization to netCDF files, its value must be of one of the following types: str, Number, ndarray, number, list, tuple
. This also occurs when using assign_attrs
, as in ds["test"] = ds["test"].assign_attrs({"coordinates": None})
.
Setting encoding using the encoding
parameter of to_netcdf
also doesn't work for me:
>>> import numpy as np, xarray as xr
>>> data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
>>> ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "fake": 10})
>>> ds = ds.assign({"test": 1})
>>> ds.to_netcdf("file.nc", encoding={"test": {"coordinates": None}})
throws unexpected encoding parameters for 'netCDF4' backend: ['coordinates']. Valid encodings are: {'least_significant_digit', 'dtype', 'zlib', '_FillValue', 'contiguous', 'chunksizes', 'complevel', 'shuffle', 'fletcher32'}
This one makes sense I suppose, but it's a bit confusing when coordinates can be set to None via the data variable's encoding properties.
@claytharrison This looks like a bug, because the error is raised in a check before we even enter conventions.py
where #5514 has the supposed fix.
I'm wondering if just removing coordinates
from attrs
/encoding
would be a better way of getting rid of it.
OK, I'm sure it's no good idea to allow None
as valid for serializing. But why not use False
instead. A bool would be propagated down to the place where #5514 checkes for None
. Just need to check for False
then. @claytharrison Would you mind opening a PR with the needed changes?
@kmuehlbauer Absolutely, should I have it check for either False or None, so as not to break anything, or just switch from None to False?
@claytharrison The current test from #5514 just checks if Dataset
can be encoded/decoded without actually roundtripping via disk. I'm inclined to just switch from None
to False
and adding a real roundtrip test in test_backend.py. But we can get other devs insight over at your PR.
What happened: Coordinates added to some variables unexpectedly.
I noticed this after outputting to netCDF. What I have:
On
time_bnds
,lon_bnds
,lat_bnds
andrealization
there iscoordinates
that I wouldn't expect to be there.What you expected to happen: Looking only at the coordinates attribute, I expected my ncdump to show:
I tried to remove this in the xarray dataset, but whatever I tried they always ended up back in there:
This seems like incorrect behaviour, but perhaps it is expected?
Minimal Complete Verifiable Example:
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 16:54:48) [Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.10.5 libnetcdf: 4.6.3 xarray: 0.18.2 pandas: 1.1.3 numpy: 1.19.2 scipy: None netCDF4: 1.5.4 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2.30.0 distributed: 2.30.0 matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 54.1.1 pip: 21.0.1 conda: None pytest: 6.2.2 IPython: 7.21.0 sphinx: 1.8.1