pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Unexpected NaNs in broadcast #7385

Open dopplershift opened 1 year ago

dopplershift commented 1 year ago

What happened?

When running the broadcast in the sample code, I end up with nan in the output when there are not any in the original source array. While I know the construction is really odd (this came from user-submitted code), I'm shocked that it resulted in nans the resulting broadcasted data and honestly assumed MetPy's code was doing something dumb for quite awhile. I would have expected (regardless of the nature of the coordinates) that the result for broad_a be [[1, 2], [1, 2]].

What did you expect to happen?

No response

Minimal Complete Verifiable Example

levs = np.array([100000, 85000])
a = xr.Dataset({'a': (('lev',), [1, 2])}, coords={'lev': levs}).to_array()
b = xr.Dataset({'b': (('lev',), [3, 4])}, coords={'lev': levs}).to_array()

broad_a, broad_b = xr.broadcast(a, b)
print(broad_a)

MVCE confirmation

Relevant log output

<xarray.DataArray (variable: 2, lev: 2)>
array([[ 1.,  2.],
       [nan, nan]])
Coordinates:
  * lev       (lev) int64 100000 85000
  * variable  (variable) object 'a' 'b'

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:31:57) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 21.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2022.12.0 pandas: 1.5.2 numpy: 1.23.5 scipy: 1.9.3 netCDF4: 1.6.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.13.3 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.10.3 iris: None bottleneck: 1.3.5 dask: 2022.6.1 distributed: 2022.6.1 matplotlib: 3.6.2 cartopy: 0.21.0 seaborn: None numbagg: None fsspec: 2022.11.0 cupy: None pint: 0.20.1 sparse: None flox: None numpy_groupies: None setuptools: 65.5.1 pip: 22.3.1 conda: None pytest: 7.2.0 mypy: 0.991 IPython: 8.7.0 sphinx: 5.3.0
dcherian commented 1 year ago

to_array is adding a new dimension variable with values a, b respectively.

Now when you align these, NaNs are inserted. I would insert a squeeze after to_array()

headtr1ck commented 1 year ago

@dopplershift does this answer fix your problem?

dopplershift commented 1 year ago

@dcherian Is this behavior (filling with fill_value -> inserting Nans) because they share common dimensionality in terms of name, but have different coordinate values? My expectation was something that operated more like numpy broadcasting (repeating values, not filling with anything else).

I can understand how xarray's data model yields this behavior, but in that case it might be good to improve the docs for xarray.broadcast, because it says nothing about the behavior that (seem to me) mimics xarray.align.

dcherian commented 1 year ago

Is this behavior (filling with fill_value -> inserting Nans) because they share common dimensionality in terms of name, but have different coordinate values?

Yes broadcasting is doing alignment with outer join by default: https://github.com/pydata/xarray/issues/6304. This is conceptually pretty confusing.

I agree we should document this.