pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.63k stars 1.09k forks source link

Why can't `map_blocks()` just infer the coordinate values of the output Dataset/DataArray, instead of requiring them in `template` ? #9635

Closed v-morello closed 2 weeks ago

v-morello commented 1 month ago

What is your issue?

Hello,

I am not sure whether to treat this as a bug or as a feature request yet. The title says it all, and I've attached a minimum example below. Let's say I want to average some data by blocks, but also average its attached time coordinate:

import numpy as np
import xarray
import dask.array as da

xds = xarray.Dataset(
    data_vars={"data": (("time",), np.linspace(0.0, 1.0, 10, endpoint=False))},
    coords={"time": np.arange(10)},
)
xds = xds.chunk({"time": 5})  # want to process two separate time intervals

def my_average(dataset):
    mean_time = dataset.time.mean()
    return dataset.mean(dim="time", keepdims=True).assign_coords({"time": [mean_time]})

### Hoping that map_blocks fills the output time coordinate, error
template = xarray.Dataset(
    data_vars={
        "data": (("time",), da.zeros(shape=(2,)))
    },
    coords={
        "time": da.zeros(shape=(2,)),
    },
)
template = template.chunk({"time": 1})  # align output chunks with input chunks

xarray.map_blocks(my_average, xds, template=template).compute()  # ValueError

The error is:

ValueError: Expected index 'time' to be PandasIndex(Index([0.0], dtype='float64', name='time')). Received PandasIndex(Index([7.0], dtype='float64', name='time')) instead.

Everything works as expected if I provide the correct time coordinate values in the template argument. In fact, the example shown in the function documentation does precisely this, but does not explain why it is necessary.

### Correct output time coordinate provided in advance, works
template = xarray.Dataset(
    data_vars={"data": (("time",), da.zeros(shape=(2,)))},
    coords={"time": [2.0, 7.0]},
)
template = template.chunk({"time": 1}) 

result = xarray.map_blocks(my_average, xds, template=template).compute()  # OK

I understand why "template" must specify the length of every output dimension and be chunked properly. However, is there a fundamental reason why coordinate values are treated differently from data variables here? I am wondering if that limitation could be removed in principle, because that would help our actual use case (and we might be able to help with a hypothetical fix).

welcome[bot] commented 1 month ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

dcherian commented 1 month ago

Because to infer these values, we'd have to execute the whole computation. The model here is to stay lazy, which requires that the user specify a lot of information in the template.

More generally, since time is an "indexed coordinate" here, we must know its values eagerly at least for now.

What can we add to the docstring to make this clearer?

v-morello commented 1 month ago

@dcherian Thanks for your reply!

The model here is to stay lazy.

OK, I get it now. If we require map_blocks() to be lazy, then for example .sel() still needs to work on the result without requiring execution of the map_blocks() call, and for that the coordinate value must already be present.

What can we add to the docstring to make this clearer?

I leave whether that is necessary to your better judgement (no one seems to have asked that question in the few years of existence of the map_blocks() function). Something along the lines of what I wrote just above would do the trick IMHO.