pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.58k stars 1.07k forks source link

Unclear error message when combine_by_coords doesn't find an index #9201

Closed TomNicholas closed 1 month ago

TomNicholas commented 3 months ago

What is your issue?

The error you get from inside xr.combine_by_coords when a 1D dimension coordinate is not backed by an index uses outdated verbiage. That's because it predates the indexes refactor, and this fail case wasn't anticipated at the time of writing.

The reproducer below uses the VirtualiZarr package, but only as a shortcut to generate a dataset that has 1D coordinates not backed by indexes. You could construct a pure-xarray reproducer.

In [1]: from virtualizarr import open_virtual_dataset

In [2]: import xarray as xr

In [3]: ds1 = open_virtual_dataset('air1.nc', indexes={})

In [4]: ds2 = open_virtual_dataset('air2.nc', indexes={})

In [5]: xr.combine_by_coords([ds1, ds2], coords='minimal', compat='override')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 xr.combine_by_coords([ds1, ds2], coords='minimal', compat='override')

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/core/combine.py:958, in combine_by_coords(data_objects, compat, data_vars, coords, fill_value, join, combine_attrs)
    954     grouped_by_vars = itertools.groupby(sorted_datasets, key=vars_as_keys)
    956     # Perform the multidimensional combine on each group of data variables
    957     # before merging back together
--> 958     concatenated_grouped_by_data_vars = tuple(
    959         _combine_single_variable_hypercube(
    960             tuple(datasets_with_same_vars),
    961             fill_value=fill_value,
    962             data_vars=data_vars,
    963             coords=coords,
    964             compat=compat,
    965             join=join,
    966             combine_attrs=combine_attrs,
    967         )
    968         for vars, datasets_with_same_vars in grouped_by_vars
    969     )
    971 return merge(
    972     concatenated_grouped_by_data_vars,
    973     compat=compat,
   (...)
    976     combine_attrs=combine_attrs,
    977 )

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/core/combine.py:959, in <genexpr>(.0)
    954     grouped_by_vars = itertools.groupby(sorted_datasets, key=vars_as_keys)
    956     # Perform the multidimensional combine on each group of data variables
    957     # before merging back together
    958     concatenated_grouped_by_data_vars = tuple(
--> 959         _combine_single_variable_hypercube(
    960             tuple(datasets_with_same_vars),
    961             fill_value=fill_value,
    962             data_vars=data_vars,
    963             coords=coords,
    964             compat=compat,
    965             join=join,
    966             combine_attrs=combine_attrs,
    967         )
    968         for vars, datasets_with_same_vars in grouped_by_vars
    969     )
    971 return merge(
    972     concatenated_grouped_by_data_vars,
    973     compat=compat,
   (...)
    976     combine_attrs=combine_attrs,
    977 )

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/core/combine.py:619, in _combine_single_variable_hypercube(datasets, fill_value, data_vars, coords, compat, join, combine_attrs)
    613 if len(datasets) == 0:
    614     raise ValueError(
    615         "At least one Dataset is required to resolve variable names "
    616         "for combined hypercube."
    617     )
--> 619 combined_ids, concat_dims = _infer_concat_order_from_coords(list(datasets))
    621 if fill_value is None:
    622     # check that datasets form complete hypercube
    623     _check_shape_tile_ids(combined_ids)

File ~/miniconda3/envs/numpy2.0_released/lib/python3.11/site-packages/xarray/core/combine.py:92, in _infer_concat_order_from_coords(datasets)
     90 indexes = [ds._indexes.get(dim) for ds in datasets]
     91 if any(index is None for index in indexes):
---> 92     raise ValueError(
     93         "Every dimension needs a coordinate for "
     94         "inferring concatenation order"
     95     )
     97 # TODO (benbovy, flexible indexes): support flexible indexes?
     98 indexes = [index.to_pandas_index() for index in indexes]

ValueError: Every dimension needs a coordinate for inferring concatenation order

In this reproducer the dimension time has a coordinate, it just doesn't have an index backing that coordinate. The error message also doesn't say which dimension is the problem.

This error message should say something more like

"ValueError: Every dimension requires a corresponding 1D coordinate and index for inferring concatenation order but the coordinate 'time' has no corresponding index"

One could even argue that the name combine_by_coords should really be combine_using_indexes ...

Shripad1020 commented 3 months ago

Hi @TomNicholas,

I want to work on this issue. I'm new to the open-source world and eager to contribute. I'm currently going through the "How to Contribute" documentation. Could you please assign this issue to me?

Thank you!

TomNicholas commented 3 months ago

Hi @Shripad1020! Thanks for your interest. We don't generally assign issues to people but you're welcome to submit a pull request.

I think the trickiest part of this issue for a newcomer would be making a pure-xarray reproducer, but I can help you with that.

Shripad1020 commented 3 months ago

Thank you for your response!

I appreciate your offer to help with making a pure-xarray reproducer. I have gone through the xarray documentation and have a brief idea of what it is and how it works. Could you please explain more about what you mean by making a pure-xarray reproducer?

TomNicholas commented 3 months ago

I have gone through the xarray documentation and have a brief idea of what it is and how it works.

Have you used xarray before?

Could you please explain more about what you mean by making a pure-xarray reproducer?

Generally when raising an issue it is encouraged to post a snippet of code that reproduces the bug, where that code snippet is as simple as possible whilst still exposing the bug. In my example above I was a little lazy, and posted a snippet which uses another library which wraps xarray to reproduce a bug that's inside xarray. If my example had not used any other library it would be a "pure xarray reproducer" of the bug.

One reason it's encouraged to post a pure xarray reproducer is that any pull request to fix a bug should included a new unit test to test that the bug has been fixed. These unit tests need to be written into xarray's test suite, and so cannot rely on external libraries. The pure xarray reproducer (also known as an MCVE) is usually a good starting point for this unit test.

nicrie commented 1 month ago

I guess the following works as a pure xarray reproducer:

import xarray as xt

da1 = xr.DataArray([1, 2, 3], dims="x", coords={"x": [1, 2, 3]})
da2 = xr.DataArray([1, 2, 3], dims="x", coords={"x": [4, 5, 6]})
da1 = da1.drop_indexes("x")
da2 = da2.drop_indexes("x")
xr.combine_by_coords([da1, da2])

resulting in

ValueError: Every dimension needs a coordinate for inferring concatenation order