pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.5k stars 1.04k forks source link

Dataset.weighted along a dimension not on weights errors #8679

Open mathause opened 5 months ago

mathause commented 5 months ago

What happened?

ds.weighted(weights).mean(dims) errors when reducing over a dimension that is neither on the weights nor on the variable.

What did you expect to happen?

This used to work and was "broken" by #8606. However, we may want to fix this by ignoring (?) those data vars instead (#7027).

Minimal Complete Verifiable Example

import xarray as xr

ds = xr.Dataset({"a": (("y", "x"), [[1, 2]]), "scalar": 1})
weights = xr.DataArray([1, 2], dims="x")

ds.weighted(weights).mean("y")

MVCE confirmation

Relevant log output

ValueError                                Traceback (most recent call last)
Cell In[1], line 6
      3 ds = xr.Dataset({"a": (("y", "x"), [[1, 2]]), "scalar": 1})
      4 weights = xr.DataArray([1, 2], dims="x")
----> 6 ds.weighted(weights).mean("y")

File ~/code/xarray/xarray/util/deprecation_helpers.py:115, in _deprecate_positional_args.<locals>._decorator.<locals>.inner(*args, **kwargs)
    111     kwargs.update({name: arg for name, arg in zip_args})
    113     return func(*args[:-n_extra_args], **kwargs)
--> 115 return func(*args, **kwargs)

File ~/code/xarray/xarray/core/weighted.py:497, in Weighted.mean(self, dim, skipna, keep_attrs)
    489 @_deprecate_positional_args("v2023.10.0")
    490 def mean(
    491     self,
   (...)
    495     keep_attrs: bool | None = None,
    496 ) -> T_Xarray:
--> 497     return self._implementation(
    498         self._weighted_mean, dim=dim, skipna=skipna, keep_attrs=keep_attrs
    499     )

File ~/code/xarray/xarray/core/weighted.py:558, in DatasetWeighted._implementation(self, func, dim, **kwargs)
    555 def _implementation(self, func, dim, **kwargs) -> Dataset:
    556     self._check_dim(dim)
--> 558     return self.obj.map(func, dim=dim, **kwargs)

File ~/code/xarray/xarray/core/dataset.py:6924, in Dataset.map(self, func, keep_attrs, args, **kwargs)
   6922 if keep_attrs is None:
   6923     keep_attrs = _get_keep_attrs(default=False)
-> 6924 variables = {
   6925     k: maybe_wrap_array(v, func(v, *args, **kwargs))
   6926     for k, v in self.data_vars.items()
   6927 }
   6928 if keep_attrs:
   6929     for k, v in variables.items():

File ~/code/xarray/xarray/core/dataset.py:6925, in <dictcomp>(.0)
   6922 if keep_attrs is None:
   6923     keep_attrs = _get_keep_attrs(default=False)
   6924 variables = {
-> 6925     k: maybe_wrap_array(v, func(v, *args, **kwargs))
   6926     for k, v in self.data_vars.items()
   6927 }
   6928 if keep_attrs:
   6929     for k, v in variables.items():

File ~/code/xarray/xarray/core/weighted.py:286, in Weighted._weighted_mean(self, da, dim, skipna)
    278 def _weighted_mean(
    279     self,
    280     da: T_DataArray,
    281     dim: Dims = None,
    282     skipna: bool | None = None,
    283 ) -> T_DataArray:
    284     """Reduce a DataArray by a weighted ``mean`` along some dimension(s)."""
--> 286     weighted_sum = self._weighted_sum(da, dim=dim, skipna=skipna)
    288     sum_of_weights = self._sum_of_weights(da, dim=dim)
    290     return weighted_sum / sum_of_weights

File ~/code/xarray/xarray/core/weighted.py:276, in Weighted._weighted_sum(self, da, dim, skipna)
    268 def _weighted_sum(
    269     self,
    270     da: T_DataArray,
    271     dim: Dims = None,
    272     skipna: bool | None = None,
    273 ) -> T_DataArray:
    274     """Reduce a DataArray by a weighted ``sum`` along some dimension(s)."""
--> 276     return self._reduce(da, self.weights, dim=dim, skipna=skipna)

File ~/code/xarray/xarray/core/weighted.py:231, in Weighted._reduce(da, weights, dim, skipna)
    227     da = da.fillna(0.0)
    229 # `dot` does not broadcast arrays, so this avoids creating a large
    230 # DataArray (if `weights` has additional dimensions)
--> 231 return dot(da, weights, dim=dim)

File ~/code/xarray/xarray/util/deprecation_helpers.py:140, in deprecate_dims.<locals>.wrapper(*args, **kwargs)
    132     emit_user_level_warning(
    133         "The `dims` argument has been renamed to `dim`, and will be removed "
    134         "in the future. This renaming is taking place throughout xarray over the "
   (...)
    137         PendingDeprecationWarning,
    138     )
    139     kwargs["dim"] = kwargs.pop("dims")
--> 140 return func(*args, **kwargs)

File ~/code/xarray/xarray/core/computation.py:1885, in dot(dim, *arrays, **kwargs)
   1883     dim = tuple(d for d, c in dim_counts.items() if c > 1)
   1884 else:
-> 1885     dim = parse_dims(dim, all_dims=tuple(all_dims))
   1887 dot_dims: set[Hashable] = set(dim)
   1889 # dimensions to be parallelized

File ~/code/xarray/xarray/core/utils.py:1046, in parse_dims(dim, all_dims, check_exists, replace_none)
   1044     dim = (dim,)
   1045 if check_exists:
-> 1046     _check_dims(set(dim), set(all_dims))
   1047 return tuple(dim)

File ~/code/xarray/xarray/core/utils.py:1131, in _check_dims(dim, all_dims)
   1129 if wrong_dims:
   1130     wrong_dims_str = ", ".join(f"'{d!s}'" for d in wrong_dims)
-> 1131     raise ValueError(
   1132         f"Dimension(s) {wrong_dims_str} do not exist. Expected one or more of {all_dims}"
   1133     )

ValueError: Dimension(s) 'y' do not exist. Expected one or more of {'x'}

Anything else we need to know?

No response

Environment

Newest main (i.e. 2024.01)

etienneschalk commented 5 months ago

Related issues:

etienneschalk commented 5 months ago

Hello,

I started digging into this but kind of get lost into the code. So, I changed approach. I assume that the weighted reductions behaviours should be aligned as much as possible on the regular non-weighted reductions, which is currently not the case. So somehow, the logic has to be aligned, potentially refactored away? It seems this family of bugs (or better said, unexpected behaviours) root for diverging implementations and lack of delegation to default existing logic. Of course, this is easier said than done!

I have written a test describing common scenarios, comparing non-weighted (regular) reductions vs weighted reductions You can find the code of the test (test_non_weighted_vs_weighted_behaviour) here for more details: https://github.com/etienneschalk/xarray/blob/ab0488caf3e81e37e02beeb709cf0c46f72a7c83/xarray/tests/test_weighted.py#L916 (:warning: this is work in progress, code not clean, experimenting with breaking things!)

Here is the matrix of behaviour summarizing the testing approach:

The "Weighted (expect)" column is identical to "Non-weighted (reference)", I put it to emphasizes that the behaviours should match, but maybe not 100% in case subtle unavoidable variations are discovered? For now I made the assumption that it should be a perfect match.

Scenario ID / Behaviour Non-weighted (reference) Weighted (expected) Weighted (current) Is current aligned on reference?
(1) DataArray
xr.DataArray(1)
:x: r"'x' not found in array dimensions ()"

(Handled on DataArray level)
:x: r"'x' not found in array dimensions ()"

(Handled on DataArray level)
:x: r"'x' not found in array dimensions ()"

(Handled on DataArray level)
:white_check_mark:
(2) Dataset without reduction dim
xr.Dataset({"var": 1})
:x: r"Dimensions ('x',) not found in data dimensions ()"

(Handled on Dataset level)
:x: r"Dimensions ('x',) not found in data dimensions ()"

(Handled on Dataset level)
:x: r"'x' not found in array dimensions ()"

(Handled on DataArray level)
:x:
(3) Dataset with reduction dim
xr.Dataset({"var": 1, "x_dependant": ("x", [2, 4])})
:white_check_mark:

(Handled on Dataset level)
:white_check_mark:

(Handled on Dataset level)
:x: r"'x' not found in array dimensions ()"

(Handled on DataArray level)
:x:

So, according to this table, the issue is that the weighted behaviour for Dataset is not as specialized than the non-weighted one. Indeed, the Dataset non-weighted dimensions allow to ignore a reduction for a variable that does not depends on the reducing dimension. The only check seems to be that the reduced Dataset should at least contain one variable dependant on the reducing dimension. This "lenient" approach seems more user friendly as it is common to have Datasets with variables of various dimensions, so ignoring some of these variables simplifies the user experience, while reductions over DataArrays are more strict, since the user should know the dimensions of the DataArray that is being reduced (in opposition to plenty of variables in a Dataset).

The question now would be: how to align weighted reductions with their non-weighted counterparts, so that "UX choices" are made only once (for non-weighted that is the reference)?