pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.64k stars 1.09k forks source link

`polyfit` with weights alters the DataArray in place #5644

Closed scottstanie closed 1 year ago

scottstanie commented 3 years ago

What happened:

After running da.polyfit on a DataArray with weights, the data has been overwritten

What you expected to happen:

I didn't see this documented anywhere, but I did not expect that creating a polyfit dataset would clobber the original data that I'm fitting to. The data isn't altered in the case of unweighted fitting, only weighted.

Minimal Complete Verifiable Example:

In [2]: import xarray as xr; import numpy as np
In [3]: nz, ny, nx = (10, 20, 30)
In [4]: da = xr.DataArray(np.random.rand(nz, ny ,nz), dims=['z','y','x'])
In [6]: da.mean(), da.max()
Out[6]:
(<xarray.DataArray ()>
 array(0.4963857),
 <xarray.DataArray ()>
 array(0.99996494))

In [7]: pf = da.polyfit("z", deg=2) # This will not alter the data
In [9]: da.mean(), da.max()
Out[9]:
(<xarray.DataArray ()>
 array(0.4963857),
 <xarray.DataArray ()>
 array(0.99996494))

# Non-zero `w` argument alters the data
In [11]: pf = da.polyfit("z", deg=2, w=np.arange(nz))

In [12]: da.mean(), da.max()
Out[12]:
(<xarray.DataArray ()>
 array(2.24317611),
 <xarray.DataArray ()>
 array(8.95963569))

Anything else we need to know?:

I assume it's happening here https://github.com/pydata/xarray/blob/da99a5664df4f5013c2f6b0e758394bec5e0bc80/xarray/core/dataset.py#L6805

My question is whether this is supposed to be the case to avoid copies? Or if it's accidental?

Environment:

Output of xr.show_versions() xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 18:42:56) [Clang 10.0.1 ] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.19.0 pandas: 1.1.2 numpy: 1.20.2 scipy: 1.5.2 netCDF4: 1.5.4 pydap: None h5netcdf: None h5py: 3.2.1 Nio: None zarr: 2.6.1 cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.1 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2.14.0 distributed: 2.20.0 matplotlib: 3.3.0 cartopy: 0.18.0 seaborn: 0.10.1 numbagg: None pint: 0.16.1 setuptools: 49.6.0.post20200814 pip: 21.1.2 conda: 4.8.4 pytest: 6.2.4 IPython: 7.18.1 sphinx: 3.5.1
scottstanie commented 3 years ago

As a temporary workout for my case, I'm just going to do

In [3]: pf = (da.copy(True)).polyfit("z", deg=2, w=np.arange(nz))
In [5]: da.max(), da.mean()
Out[5]:
(<xarray.DataArray ()>
 array(0.99878237),
 <xarray.DataArray ()>
 array(0.50869358))

The thing I don't understand is that _to_temp_dataset seems to be trying to do a deep copy (based on the argument names)

https://github.com/pydata/xarray/blob/da99a5664df4f5013c2f6b0e758394bec5e0bc80/xarray/core/dataarray.py#L490-L491

But it is acting like a shallow copy:

In [6]: pf = (da.copy(False)).polyfit("z", deg=2, w=np.arange(nz))
In [7]: da.max(), da.mean()
Out[7]:
(<xarray.DataArray ()>
 array(8.92217147),
 <xarray.DataArray ()>
 array(2.29014397))
dcherian commented 3 years ago

yikes thanks for the nice report!

malmans2 commented 1 year ago

Hi! I was about to open a new issue about this, but looks like it's a known issue and there's a stale PR... Let me know if I can help to get this fixed!

headtr1ck commented 1 year ago

You can always reach out to the creator of the original PR and comment in the PR.

But it looks like this particular PR was reaching a dead end and should be completely rewritten. But anyway the reviewers left helpful remarks on how to proceed.

dcherian commented 1 year ago

Do you know where the in-place modification is happening? We could just copy there and fix this particular issue.

malmans2 commented 1 year ago

Do you know where the in-place modification is happening? We could just copy there and fix this particular issue.

Not sure, but I'll take a look!