pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.48k stars 1.04k forks source link

DataArray.mean drops coordinates #9168

Open derhintze opened 6 days ago

derhintze commented 6 days ago

What happened?

Averaging the data variables along some dimension drops coordinates that also have that dimension.

What did you expect to happen?

I would expect that the coordinates aren't dropped, but averaged along said dimension, too.

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr

data = xr.DataArray(
    np.ones((3, 2)),
    dims=["dim0", "dim1"],
    coords={"foo": (("dim0", "dim1"), np.zeros((3, 2)))},
)

print(data.mean(dim="dim0"))

MVCE confirmation

Relevant log output

<xarray.DataArray (dim1: 2)> Size: 16B
array([1., 1.])
Dimensions without coordinates: dim1

Anything else we need to know?

I had a look at #1470 and #3510, but those appear unrelated?

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.7 (default, Jan 16 2024, 12:46:10) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development xarray: 2024.5.0 pandas: 2.2.2 numpy: 1.26.2 scipy: 1.13.1 netCDF4: 1.6.4 pydap: None h5netcdf: None h5py: None zarr: None cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.9.0 cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 57.4.0 pip: 21.2.3 conda: None pytest: 8.2.2 mypy: 1.10.0 IPython: 8.16.1 sphinx: None
derhintze commented 6 days ago

Can confirm that the output is the same with xarray 2024.6.0

keewis commented 6 days ago

I believe this may be intentional (I may be wrong, though): it is often not so useful to reduce the coordinates with the same operation as the data, and so xarray drops them instead.

If you really need this, you can convert them to data variables first using .reset_coords(names), do the reduction, then use .set_coords(names).

derhintze commented 6 days ago

@keewis Thanks! I'm not sure if it's "often" not so useful, tho ;) Can't come up with a reasonable example from our field (2D sensor data processing), but I get the point. I did what you suggest as a work-around, but I had hoped for a better solution. A bit tedious. The thing is, coarsen indeed does mean coords by default. So also some contraption like

data.coarsen({"dim0": data.sizes["dim0"]}).mean(dim="dim0").squeeze()

would work. But reading this, imho, suggests that data.mean(dim="dim0") should do the same.. but well, that's subjective ;)

max-sixty commented 6 days ago

This is indeed intentional — the role of coordinates is to have things which aren't computed along. That's particularly the case when doing something like .lag — we don't want the coords lagging — but also the case with a reduction.

Are there times which xarray is inconsistent there? Is there an example of where something "should" be a coordinate but should also be reduced over?

headtr1ck commented 5 days ago

Maybe we could add an option to the reductions that allows to change this behavior? Something like data.mean(dim="dim0", coords="mean") with a default value of "drop".

But the workaround could be sufficient here.

derhintze commented 5 days ago

@max-sixty

Are there times which xarray is inconsistent there?

Well, if you consider the behaviour I described above considering coarsen consistent with not reducing over coordinates, where coarsen does reduce over coords, then, no, not that I'm aware of. To be fair, though, it's documented that coarsen does average coords by default.

Is there an example of where something "should" be a coordinate but should also be reduced over?

That's a hard question, since it would depend on conventions of what people put into coords. We have time-series of 2D sensor images as data variables, where we want to do operations with, and then add coordinates containing metadata, like temperatures, time stamps, measurement-specific inputs like light-source wave-length or power. In all of those cases, when averaging over the time-series of 2D sensor data, we'd like to average the coordinates, too.

Granted, given there are work-arounds, and we can implement our own wrapping for this sort of stuff, it's not a big deal.

max-sixty commented 5 days ago

Yes, very reasonable @derhintze !

Good point around coarsen. I do think that's somewhat specific to coarsen, where it's applying a transformation to coords / labels. I agree it makes the separation a bit fuzzier.

I would vote to retain the behavior around coords — data.mean(dim="dim0", coords="mean") seems not much simpler than moving coords to vars and introduces more surface area to the API...