pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.55k stars 1.06k forks source link

Change default for concat_characters to False in open_* functions #4452

Open eric-czech opened 3 years ago

eric-czech commented 3 years ago

I wanted to propose that concatcharacters be False for `open{dataset,zarr,dataarray}`. I'm not sure how often that affects anyone since working with individual character arrays is probably rare, but it's a particularly bad default in genetics. We often represent individual variations as single characters and the concatenation is destructive because we can't invert it when one of the characters is an empty string (which often corresponds to a deletion at a base pair location, and the order of the characters matters).

I also find it to be confusing behavior (e.g. https://github.com/pydata/xarray/issues/4405) since no other arrays are automatically transformed like this when deserialized.

If submit a PR for this, would anybody object?

shoyer commented 3 years ago

I agree that there's is no good reason to use concat_characters for zarr, which supports normal fixed-width string datatypes.

For netCDF, we do need concat_character for the "NC_CHAR" dtype, which is used to store strings in lieu of a true fixed width string dtype. It's ugly, but otherwise we won't be able to round-trip string dtype arrays from xarray into netCDF3 files. This note from NetCDF.jl does a nice job of explaining.

dcherian commented 3 years ago

we could make the default None in open_data* and set True/False appropriately for netCDF/Zarr backends?

technically we would need to warn for a couple of releases before changing the default in open_zarr but maybe no one cares too much?