pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.63k stars 1.09k forks source link

Numpy 2 incompatiblity with xr.concat on DataArrays with scalar np.str_ coord #9600

Closed Zyantist closed 1 month ago

Zyantist commented 1 month ago

What happened?

Upgrading numpy from 1.26.4 to 2.1.2 breaks my code. I went through several pages of issues looking for "concat", but none seemed to fit.

The xr.concat method applied to a list of DataArrays that are to be concatenated along a scalar coordinate seems to no longer work. When the DataArrays are created, it used to convert a scalar coord of np.str_ type to a numpy array with dtype <U... . This conversion seems to be gone and, without it, my code no longer works.

Instead a rather cryptic error message appears (full traceback below, here the last bit):

File ~/Projects/temp/xarray_numpy_bug/.env/lib64/python3.12/site-packages/xarray/core/variable.py:1387, in Variable.set_dims(self, dim, shape)
   1385 else:
   1386     indexer = (None,) * (len(expanded_dims) - self.ndim) + (...,)
-> 1387     expanded_data = self.data[indexer]
   1389 expanded_var = Variable(
   1390     expanded_dims, expanded_data, self._attrs, self._encoding, fastpath=True
   1391 )
   1392 return expanded_var.transpose(*dim)

TypeError: string indices must be integers, not 'tuple'

self.data with latest numpy is just a string version of a UUID (was formerly converted to a numpy array) and the indexer is (None, Ellipsis).

What did you expect to happen?

In contrast to the output posted below in the "Minimal Complete Verifiable Example" and "Relevant log output", I expected this output that I get with numpy version 1.26.4 :

xr.concat([xarr, xarr2], dim=("scalar_coord"))
<xarray.DataArray (scalar_coord: 2, abc: 3)> Size: 48B
array([[1., 1., 1.],
       [1., 1., 1.]])
Coordinates:
  * abc           (abc) <U1 12B 'a' 'b' 'c'
  * scalar_coord  (scalar_coord) <U36 288B '90ff719e-6e3b-434a-b4f1-facfa168b...

Where the `xarr1.coords["scalar_coord"] looks like this (an array created from a scalar):

<xarray.DataArray 'scalar_coord' ()> Size: 144B
array('90ff719e-6e3b-434a-b4f1-facfa168b2e1', dtype='<U36')
Coordinates:
    scalar_coord  <U36 144B '90ff719e-6e3b-434a-b4f1-facfa168b2e1'

Minimal Complete Verifiable Example

# Python 3.12.3 (main, Apr 17 2024, 00:00:00) [GCC 14.0.1 20240411 (Red Hat 14.0.1-0)]
# Type 'copyright', 'credits' or 'license' for more information
# IPython 8.28.0 -- An enhanced Interactive Python. Type '?' for help.

import xarray as xr
import numpy as np
import uuid
xr.__version__
# '2024.9.0'
np.__version__
# 2.1.2'
xarr = xr.DataArray(np.array([1.0,] * 3, dtype=np.float64), dims=("abc"), coords=dict(abc=np.array(list("abc"), dtype="<U1"), scalar_coord=np.str_(uuid.uuid4())))

xarr.coords["scalar_coord"]
# <xarray.DataArray 'scalar_coord' ()> Size: 144B
# np.str_('56382178-7f7d-4ec8-a4c1-8ebee96ec8df')
# Coordinates:
#     scalar_coord  <U36 144B ...
# 
xarr.coords["scalar_coord"].data
# np.str_('56382178-7f7d-4ec8-a4c1-8ebee96ec8df')
xarr2 = xr.DataArray(np.array([1.0,] * 3, dtype=np.float64), dims=("abc"), coords=dict(abc=np.array(list("abc"), dtype="<U1"), scalar_coord=np.str_(uuid.uuid4())))
xr.concat([xarr, xarr2], dim=("scalar_coord"))
# see error in "Relevant log output"

MVCE confirmation

Relevant log output

In [10]: xr.concat([xarr, xarr2], dim=("scalar_coord"))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 xr.concat([xarr, xarr2], dim=("scalar_coord"))

File ~/Projects/temp/xarray_numpy_bug/.env/lib64/python3.12/site-packages/xarray/core/concat.py:264, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    259     raise ValueError(
    260         f"compat={compat!r} invalid: must be 'broadcast_equals', 'equals', 'identical', 'no_conflicts' or 'override'"
    261     )
    263 if isinstance(first_obj, DataArray):
--> 264     return _dataarray_concat(
    265         objs,
    266         dim=dim,
    267         data_vars=data_vars,
    268         coords=coords,
    269         compat=compat,
    270         positions=positions,
    271         fill_value=fill_value,
    272         join=join,
    273         combine_attrs=combine_attrs,
    274         create_index_for_new_dim=create_index_for_new_dim,
    275     )
    276 elif isinstance(first_obj, Dataset):
    277     return _dataset_concat(
    278         objs,
    279         dim=dim,
   (...)
    287         create_index_for_new_dim=create_index_for_new_dim,
    288     )

File ~/Projects/temp/xarray_numpy_bug/.env/lib64/python3.12/site-packages/xarray/core/concat.py:755, in _dataarray_concat(arrays, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    752             arr = arr.rename(name)
    753     datasets.append(arr._to_temp_dataset())
--> 755 ds = _dataset_concat(
    756     datasets,
    757     dim,
    758     data_vars,
    759     coords,
    760     compat,
    761     positions,
    762     fill_value=fill_value,
    763     join=join,
    764     combine_attrs=combine_attrs,
    765     create_index_for_new_dim=create_index_for_new_dim,
    766 )
    768 merged_attrs = merge_attrs([da.attrs for da in arrays], combine_attrs)
    770 result = arrays[0]._from_temp_dataset(ds, name)

File ~/Projects/temp/xarray_numpy_bug/.env/lib64/python3.12/site-packages/xarray/core/concat.py:540, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    535 # case where concat dimension is a coordinate or data_var but not a dimension
    536 if (
    537     dim_name in coord_names or dim_name in data_names
    538 ) and dim_name not in dim_names:
    539     datasets = [
--> 540         ds.expand_dims(dim_name, create_index_for_new_dim=create_index_for_new_dim)
    541         for ds in datasets
    542     ]
    544 # determine which variables to concatenate
    545 concat_over, equals, concat_dim_lengths = _calc_concat_over(
    546     datasets, dim_name, dim_names, data_vars, coords, compat
    547 )

File ~/Projects/temp/xarray_numpy_bug/.env/lib64/python3.12/site-packages/xarray/core/dataset.py:4797, in Dataset.expand_dims(self, dim, axis, create_index_for_new_dim, **dim_kwargs)
   4793 if k not in variables:
   4794     if k in coord_names and create_index_for_new_dim:
   4795         # If dims includes a label of a non-dimension coordinate,
   4796         # it will be promoted to a 1D coordinate with a single value.
-> 4797         index, index_vars = create_default_index_implicit(v.set_dims(k))
   4798         indexes[k] = index
   4799         variables.update(index_vars)

File ~/Projects/temp/xarray_numpy_bug/.env/lib64/python3.12/site-packages/xarray/util/deprecation_helpers.py:143, in deprecate_dims.<locals>.wrapper(*args, **kwargs)
    135     emit_user_level_warning(
    136         f"The `{old_name}` argument has been renamed to `dim`, and will be removed "
    137         "in the future. This renaming is taking place throughout xarray over the "
   (...)
    140         PendingDeprecationWarning,
    141     )
    142     kwargs["dim"] = kwargs.pop(old_name)
--> 143 return func(*args, **kwargs)

File ~/Projects/temp/xarray_numpy_bug/.env/lib64/python3.12/site-packages/xarray/core/variable.py:1387, in Variable.set_dims(self, dim, shape)
   1385 else:
   1386     indexer = (None,) * (len(expanded_dims) - self.ndim) + (...,)
-> 1387     expanded_data = self.data[indexer]
   1389 expanded_var = Variable(
   1390     expanded_dims, expanded_data, self._attrs, self._encoding, fastpath=True
   1391 )
   1392 return expanded_var.transpose(*dim)

TypeError: string indices must be integers, not 'tuple'

Anything else we need to know?

I created a fresh fedora container and created two new virtual environments in which I executed the exact same code to ensure this really has just to do with xarray and numpy versions.

I went through all 3 pages of open issues on "concat" and read those that appeared to possibly be relevant, but none seemed to match my case. Truely sorry if I overlooked something!

$  toolbox create -i fedora-toolbox:40 xarray_fedora40
$  toolbox enter xarray_fedora40
$  cd Projects/temp
$  mkdir xarray_numpy_bug
$  cd xarray_numpy_bug/
$  python --version
Python 3.12.3
$  python -m venv .env
$  . .env/bin/activate
$  pip --isolated install --upgrade pip ipython setuptools
$  pip --isolated install xarray
$  ipython
# broken code with latest numpy
$  deactivate
$  python -m venv .env_old_numpy
$  . .env_old_numpy/bin/activate
$  pip --isolated install --upgrade pip ipython setuptools
$  pip --isolated install xarray
$  pip --isolated install numpy==1.26.4
$  ipython
# same code with "old" numpy, works as before

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.3 (main, Apr 17 2024, 00:00:00) [GCC 14.0.1 20240411 (Red Hat 14.0.1-0)] python-bits: 64 OS: Linux OS-release: 6.10.12-200.fc40.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2024.9.0 pandas: 2.2.3 numpy: 2.1.2 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 75.1.0 pip: 24.2 conda: None pytest: None mypy: None IPython: 8.28.0 sphinx: None
welcome[bot] commented 1 month ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

keewis commented 1 month ago

thanks for the detailed report.

I believe this is the same as #9399, which was fixed by #9403 (and I can't reproduce on main with numpy>=2.1). We're only waiting on a release now, which should happen soon(-ish).

Zyantist commented 1 month ago

Thank you for the quick reply. I justed tested it with the main branch and it works indeed