pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.5k stars 1.04k forks source link

fillna removes non-indexed coordinates in some cases #9124

Open JGuetschow opened 3 weeks ago

JGuetschow commented 3 weeks ago

What happened?

fillna (and other functions which need aligning) remove non-indexed extra coordinates (along a dimensions which also have an indexed coordinates)

Below I have an example with fillna, But I have experienced similar things with combine_first and assume it will happen whenever alignment is needed. More on what I think is happening below.

What did you expect to happen?

I expected the extra coordinate to remain in place as it was defined consistently in both datasets. Alternatively an error message would also help to understand what's going on. Currently it's dropped silently.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np

# create a dataset wit a few dimensions and random data
area_iso3 = np.array(["COL", "ARG", "MEX", "BOL"])

test_ds = xr.Dataset(
    { "CO2": 
          xr.DataArray(data=np.ones(len(area_iso3)),
                       coords={
                           "area (ISO3)": area_iso3,
                       },
                       dims=["area (ISO3)"])
    }
) 

# attach an additional coordinate the existing dimensions
country_names = ["Colombia", "Argentina", "Mexico", "Bolovia"]
test_ds = test_ds.assign_coords(country_name=("area (ISO3)", country_names))

#### use a loc to fill - dim with additional coordinate involved
test_ds_loc = test_ds.loc[{'area (ISO3)': ['COL', 'ARG']}]
print(f"test_ds_loc coords: {test_ds_loc.coords}")

# set some values to nan to fill later
test_ds["CO2"].loc[{'area (ISO3)': ['COL', 'ARG']}] = np.nan

# fill
test_ds = test_ds.fillna(test_ds_loc)
print(f"test_ds coords after fillna: {test_ds.coords}\n")
# additional coordinate gone

MVCE confirmation

Relevant log output

No error is raise and no log output generated. My output of the example code is:

test_ds_loc coords: Coordinates:
  * area (ISO3)   (area (ISO3)) <U3 24B 'COL' 'ARG'
    country_name  (area (ISO3)) <U9 72B 'Colombia' 'Argentina'
test_ds coords after fillna: Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'

Anything else we need to know?

The problem only occurs when the .loc is done on the dimension with the additional coordinate. I think the reason for the problem is the following:

align fills the additional coordinate from the smaller dataset (same for dataarray) with np.nan to expand it to the larger dataset. Later in the process the aligned coordinates are passed to merge_coordinates_without_align which removes the additional coordinate as it contains np.nan in one of the datasets where the other dataset has values.

So the non-index coordinates are neither combined like the indexed coordinates, nor filled like the data variables.

Below a short example involving only align and merge_coordinates_without_align.

import xarray as xr
from xarray.core.merge import merge_coordinates_without_align
import numpy as np

# create a dataset wit a few dimensions and random data
area_iso3 = np.array(["COL", "ARG", "MEX", "BOL"])

test_ds = xr.Dataset(
    { "CO2":
          xr.DataArray(data=np.ones(4),
                       coords={
                           "area (ISO3)": area_iso3,
                       },
                       dims=["area (ISO3)"])
      }
)

# attach an additional coordinate to one of the existing dimensions
country_names = ["Colombia", "Argentina", "Mexico", "Bolovia"]
test_ds = test_ds.assign_coords(country_name=("area (ISO3)", country_names))

test_ds_loc = test_ds.loc[{'area (ISO3)': ['COL', 'ARG']}]
print(f"test_ds_loc coords: {test_ds_loc.coords}")

aligned = xr.align(test_ds,test_ds_loc,join='outer')
merged = merge_coordinates_without_align(aligned)
print(f"merged coords: {merged[0]}")

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] python-bits: 64 OS: Linux OS-release: 6.5.0-35-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: None xarray: 2024.6.0 pandas: 2.2.2 numpy: 1.26.4 scipy: 1.13.1 netCDF4: None pydap: None h5netcdf: 1.3.0 h5py: 3.11.0 zarr: None cftime: None nc_time_axis: None iris: None bottleneck: 1.3.8 dask: 2023.12.1 distributed: None matplotlib: 3.9.0 cartopy: None seaborn: None numbagg: None fsspec: 2024.6.0 cupy: None pint: 0.24 sparse: None flox: None numpy_groupies: None setuptools: 70.0.0 pip: 24.0 conda: None pytest: 7.4.4 mypy: 1.10.0 IPython: 8.25.0 sphinx: 5.3.0
welcome[bot] commented 3 weeks ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

max-sixty commented 1 week ago

This does seem like confusing behavior. We'd def welcome a fix.

JGuetschow commented 1 week ago

So far I just have a workaround for our use case which merged the additional coordinates independently (treat them like variables). I looked into the xarray code, but without sufficient time to understand it, I think chances are high that I break more than I fix. I probably won't have time to dig deeper before November, but if it's still open then I'll take a look. An easy workaround is actually to use merge were possible as I could not reproduce the problem with merge (but I only tested the use cases in our code, so I might have missed something)