Strings in coordinates may be truncated when saving concatenated rasters to zarr

ikding commented 1 month ago

What happened?

I would like to concatenate two dataarrays or two datasets, save them to zarr, read them into xarray again at a later time, and have all the coordinates intact.

However, if I concatenate two dataarrays or datasets along a string-based coordinate, if the first dataarray in the concatenation has a coordinate with shorter maximum string length than the second dataarray, saving the concatenated dataset to zarr can truncate the string.

This problem only arises when:

First dataarray's string coordinate has an encoding["dtype"] attribute
First dataarray's string coordinate length is shorter than the second dataarray's.

What did you expect to happen?

I expect the xarray dataset that was read from the zarr to have the string coordinate the same as the input dataset that I wrote in, with no truncation.

Minimal Complete Verifiable Example

import logging
import tempfile
from pathlib import Path

import dask.array as da
import numpy as np
import xarray as xr

logger = logging.getLogger(__name__)

def mb_raster_with_custom_band_names(bands: list[str]) -> xr.DataArray:
    """
    Generates a multi-band raster with custom band names.

    This function creates a multi-band raster with a specified list of band names. The raster is generated with random
    data and has a shape of (len(bands), 100, 100). The raster is assigned the EPSG:32733 coordinate reference system.

    :param bands: A list of strings specifying the band names for the raster.
    :return: A multi-band raster with custom band names as an xarray DataArray.
    """
    n_x = 100
    n_y = 100

    data_with_custom_band_names = xr.DataArray(
        (da.random.random_sample(size=(len(bands), n_x, n_y)) * 255).astype("float32"),
        coords={"band": bands, "y": np.arange(n_y), "x": np.arange(n_x)},
    )
    return data_with_custom_band_names

def zarr_round_trip(raster: xr.DataArray | xr.Dataset) -> xr.DataArray:
    """Simple function to test the round trip of a raster to zarr and back."""
    with tempfile.TemporaryDirectory() as tmpdirname:
        tmp_zarr_path = Path(tmpdirname) / "tmp_zarr.zarr"
        raster.to_zarr(tmp_zarr_path, mode="w")

        round_trip_raster = xr.open_dataset(tmp_zarr_path, engine="zarr")

    return round_trip_raster

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)

    # Make a multi-band raster with short band names.
    rgb_raster = mb_raster_with_custom_band_names(["blue", "green", "red"])
    # Force the band encoding dtype to be equal to the band value dtype. This is what was triggering the band name
    # truncation issue when saving to zarr.
    rgb_raster.band.encoding["dtype"] = rgb_raster.band.dtype

    # Make another multi-band raster with long band names.
    long_rgb_raster = mb_raster_with_custom_band_names(["the_sky_is_blue", "the_grass_is_green", "the_rose_is_red"])
    long_rgb_raster.band.encoding["dtype"] = long_rgb_raster.band.dtype

    # concat two rasters with xr_concat (short band name raster first)
    concat_raster_short_first = xr.concat([rgb_raster, long_rgb_raster], dim="band")

    logger.info(f"{concat_raster_short_first.band.values=}")
    # array(['blue', 'green', 'red', 'the_sky_is_blue', 'the_grass_is_green', 'the_rose_is_red'], dtype='<U18')

    logger.info(f"{concat_raster_short_first.band.dtype=}")
    # dtype('<U18')

    logger.info(f"{concat_raster_short_first.band.encoding['dtype']=}")
    # dtype('<U5')

    concat_raster_short_first_reread = zarr_round_trip(concat_raster_short_first)
    logger.info(f"{concat_raster_short_first_reread.band.values=}")
    # array(['blue', 'green', 'red', 'the_s', 'the_g', 'the_r'], dtype='<U5')

    logger.info(f"{concat_raster_short_first_reread.band.dtype=}")
    # dtype('<U5')

    logger.info(f"{concat_raster_short_first_reread.band.encoding['dtype']=}")
    # dtype('<U5')

MVCE confirmation

[X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
[X] Complete example — the example is self-contained, including all data and the text of any traceback.
[X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
[X] New issue — a search of GitHub Issues suggests this is not a duplicate.
[X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

INFO:__main__:concat_raster_short_first.band.values=array(['blue', 'green', 'red', 'the_sky_is_blue', 'the_grass_is_green',
       'the_rose_is_red'], dtype='<U18')
INFO:__main__:concat_raster_short_first.band.dtype=dtype('<U18')
INFO:__main__:concat_raster_short_first.band.encoding['dtype']=dtype('<U5')
INFO:__main__:concat_raster_short_first_reread.band.values=array(['blue', 'green', 'red', 'the_s', 'the_g', 'the_r'], dtype='<U5')
INFO:__main__:concat_raster_short_first_reread.band.dtype=dtype('<U5')
INFO:__main__:concat_raster_short_first_reread.band.encoding['dtype']=dtype('<U5')

Anything else we need to know?

The order of the dataarray in xr.concat mattered. If you run xr.concat([rgb_raster, long_rgb_raster], dim="band"), the truncation will happen; but if you run xr.concat([long_rgb_raster, rgb_raster], dim="band") , it will not.

I attempted to unwrap the call chain when we save a dataset or dataarray to zarr:

In the xarray.DataArray.to_zarr() function, xarray.DataArray is implicitly converted to xarray.Dataset before we save it to zarr: code.
Because we’re calling xarray.DataArray.to_dataset() directly with no additional argument, it calls xarray.DataArray._to_dataset_whole().
In xarray.DataArray._to_dataset_whole(), the coordinates attribute (a private attr called ._coords) of the dataarray is implicitly copied to the variable of the dataset. So whatever was in our dataset variables, it came from the ._coords private attr of the datarray.
The next clue as to how the dataset variables get saved to zarr is in xarray.backends.ZarrStore. Specifically, the dataset variable go through this encode_zarr_variable function. That function in turn calls xarray.conventions.encode_cf_variable. In there, the function just uses whatever the variable encoding was in the dataset, which came from the coordinate encoding from the data array.

Looks like a similar issue was handled in 2014 (https://github.com/pydata/xarray/issues/217); this bug appears to a corner case of when the encoding attr of a particular coordinate was left intact during concatenation.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.8 (main, Mar 12 2024, 11:52:02) [GCC 12.2.0] python-bits: 64 OS: Linux OS-release: 6.6.26-linuxkit machine: x86_64 processor: byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.1 xarray: 2024.5.0 pandas: 2.1.4 numpy: 1.23.5 scipy: 1.10.1 netCDF4: 1.6.3 pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 zarr: 2.16.1 cftime: 1.6.2 nc_time_axis: None iris: None bottleneck: 1.3.7 dask: 2024.4.2 distributed: 2024.4.2 matplotlib: 3.7.1 cartopy: 0.22.0 seaborn: 0.12.2 numbagg: 0.2.2 fsspec: 2024.2.0 cupy: None pint: None sparse: 0.13.0 flox: 0.6.10 numpy_groupies: 0.9.20 setuptools: 69.0.3 pip: 23.3.2 conda: None pytest: 7.4.3 mypy: 1.8.0 IPython: 8.12.0 sphinx: 7.2.6

welcome[bot] commented 1 month ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

max-sixty commented 1 month ago

That does look confusing & frustrating.

A workaround would be .drop_encoding(). (When I'm using xarray for my own work, I generally always do this, but not sure that's necessarily correct...)

@pydata/xarray do we have a view / plan for how to handle encoding? We've discussed removing it in the past. I think it's one of those issues that experienced people can get around, but trip up new users, which means it's generally underweight in how core team & contributors (who are more experienced than average) prioritize issues.

Is there a case for defaulting to dropping encoding by default on read?

pydata / xarray