pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

losing shallowness of ds.copy() on ds from xr.open_dataset #3718

Open klindsay28 opened 4 years ago

klindsay28 commented 4 years ago

MCVE Code Sample

import numpy as np
import xarray as xr

xlen = 4
x = xr.DataArray(np.linspace(0.0, 1.0, xlen), dims=('x'))

varname = 'foo'
xr.Dataset({varname: xr.DataArray(np.arange(xlen, dtype='float64'), dims=('x'), coords={'x': x})}).to_netcdf('ds.nc')

with xr.open_dataset('ds.nc') as ds:
    ds2 = ds.copy()
    ds2[varname][0] = 11.0
    print(f'ds.equals = {ds.equals(ds2)}')

with xr.open_dataset('ds.nc') as ds:
    ds2 = ds.copy()
    print(f'ds.equals = {ds.equals(ds2)}')
    ds2[varname][0] = 11.0
    print(f'ds.equals = {ds.equals(ds2)}')

Expected Output

I expect the code to write out

ds.equals = True
ds.equals = True
ds.equals = True

However, when I run it, the last line is

ds.equals = False

Problem Description

The code above writes a small xr.Dataset to a netCDF file. There are 2 context managers opening the netCDF file as ds. Both context manager blocks start by setting ds2 to a shallow copy of ds.

In the first context manager block, a value in ds2 is modified, and ds2 is compared to ds. The Datasets are still equal, confirming that the copy is shallow.

The second context manager block is the same as the first, except that ds2 is compared to ds prior changing the value the value in ds2. When this is done, the Datasets are no longer equal, indicating that ds2 is no longer a shallow copy of ds.

I don't understand how evaluating ds.equals(ds2), prior to changing a value in ds2, could decouple ds2 from ds.

I only observe this behavior when ds is set via xr.open_dataset. I don't see it when I create ds directly using xr.Dataset.

I'm rather perplexed by this.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-693.21.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.14.1 pandas: 0.25.3 numpy: 1.17.5 scipy: 1.4.1 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.1 dask: 2.9.2 distributed: 2.9.3 matplotlib: 3.1.2 cartopy: None seaborn: None numbagg: None setuptools: 45.1.0.post20200119 pip: 19.3.1 conda: None pytest: 5.3.4 IPython: 7.11.1 sphinx: None
keewis commented 4 years ago

I don't know too much about that area, but I believe that's because directly after reading a NetCDF file, you don't work with numpy arrays but with something different. Only once you call some form of np.array (such as np.asarray or np.asanyarray) on it, it becomes a numpy.ndarray.

I'm not sure if the behaviour you encountered is a bug (it probably is), but to fix your example for now, I think you should explicitly load() the dataset after opening it:

with xr.open_dataset('ds.nc').load() as ds:
    ds2 = ds.copy()
    print(f'ds.equals = {ds.equals(ds2)}')
    ds2[varname][0] = 11.0
    print(f'ds.equals = {ds.equals(ds2)}')