Error when writing string coordinate variables to zarr

jsadler2 commented 4 years ago

I saved an xarray dataset to zarr using to_zarr. I then later tried to read that dataset from the original zarr, re-chunk it, and then write to a new zarr. When I did that I get a strange error. I attached a zip of minimal version of the zarr dataset that I am using. test_sm_zarr.zip

MCVE Code Sample

import xarray as xr

sm_from_zarr = xr.open_zarr('test_sm_zarr')
sm_from_zarr.to_zarr('test_sm_zarr_from', mode='w')

Expected Output

No error

Problem Description

I get this error:

C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\core\merge.py:18: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  PANDAS_TYPES = (pd.Series, pd.DataFrame, pd.Panel)
C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\core\dataarray.py:1829: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  'DataArray', pd.Series, pd.DataFrame, pd.Panel]:
Traceback (most recent call last):
  File "rechunk_test.py", line 38, in <module>
    sm_from_zarr.to_zarr('test_sm_zarr_from', mode='w')
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\core\dataset.py", line 1414, in to_zarr
    consolidated=consolidated, append_dim=append_dim)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\api.py", line 1101, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\api.py", line 929, in dump_to_store
    unlimited_dims=unlimited_dims)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\zarr.py", line 366, in store
    unlimited_dims=unlimited_dims)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\zarr.py", line 432, in set_variables
    writer.add(v.data, zarr_array)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\xarray\backends\common.py", line 173, in add
    target[...] = source
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1115, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1210, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1501, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1550, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1659, in _chunk_setitem
    fields=fields)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1723, in _chunk_setitem_nosync
    cdata = self._encode_chunk(chunk)
  File "C:\Users\jsadler\AppData\Local\Continuum\anaconda3\envs\nwm\lib\site-packages\zarr\core.py", line 1769, in _encode_chunk
    chunk = f.encode(chunk)
  File "numcodecs/vlen.pyx", line 108, in numcodecs.vlen.VLenUTF8.encode
TypeError: expected unicode string, found 20

BUT I think it has something to do with the datatype of one of my coordinates, site_code. Because, if it do this I get no error:

import xarray as xr

sm_from_zarr = xr.open_zarr('test_sm_zarr')
sm_from_zarr['site_code'] = sm_from_zarr.site_code.astype('str')
sm_from_zarr.to_zarr('test_sm_zarr_from', mode='w')

Before converting the datatype of the site_code coordinate is object

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.12.2 pandas: 0.25.1 numpy: 1.17.1 scipy: 1.3.1 netCDF4: 1.5.1.2 pydap: installed h5netcdf: None h5py: None Nio: None zarr: 2.3.2 cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.3.0 distributed: 2.5.1 matplotlib: 3.1.1 cartopy: None seaborn: None numbagg: None setuptools: 41.2.0 pip: 19.2.3 conda: None pytest: 5.1.2 IPython: 7.8.0 sphinx: None

jhamman commented 4 years ago

Hi @jsadler2 - Can you show us what sm_from_zarr looks like (print(sm_from_zarr)) will do.

jsadler2 commented 4 years ago

Sure

In [5]: print(ds)
<xarray.Dataset>
Dimensions:     (datetime: 20, site_code: 20)
Coordinates:
  * datetime    (datetime) datetime64[ns] 1970-01-01 ... 1970-01-01T19:00:00
  * site_code   (site_code) object '01302020' '01303000' ... '01315000'
Data variables:
    streamflow  (datetime, site_code) float64 dask.array<shape=(20, 20), chunksize=(20, 20)>

jhamman commented 4 years ago

Thanks @jsadler2 - I think this is a bug in xarray. We should be able to round trip the site_coordinate variable.

dnerini commented 4 years ago

I'm experiencing the same issue, which seems to be also related to one of my coordinates having object as datatype. Luckily, the workaround proposed by @jsadler2 works in my case, too.

borispf commented 4 years ago

I ran into the same issue. It seems like zarr is inserting VLenUTF8 as a filter, but the loaded data array already has that as a filter so it's trying to double encode . So another workaround is to delete the filters on site_code:

import xarray as xr

sm_from_zarr = xr.open_zarr('tmp/test_sm_zarr')
del sm_from_zarr.site_code.encoding["filters"]
sm_from_zarr.to_zarr('tmp/test_sm_zarr_from', mode='w')

andersy005 commented 3 years ago

I ran into the same issue. It seems like zarr is inserting VLenUTF8 as a filter, but the loaded data array already has that as a filter so it's trying to double encode

Indeed. And it appears that these lines are the culprits:

https://github.com/pydata/xarray/blob/83706af66c9cb42032dbc5536b30be1da38100c0/xarray/backends/zarr.py#L468-L471

ipdb> v
<xarray.Variable (x: 3)>
array(['a', 'b', 'c'], dtype=object)
ipdb> check
False
ipdb> vn
'x'
ipdb> encoding = extract_zarr_variable_encoding(v, raise_on_invalid=check, name=vn)
ipdb> encoding
{'chunks': (3,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': [VLenUTF8()]}

https://github.com/pydata/xarray/blob/83706af66c9cb42032dbc5536b30be1da38100c0/xarray/backends/zarr.py#L480-L482

Zarr appears to be ignoring the filter information from xarray. Zarr proceeds to extracting its own filter. As a result, we end up with two filters:

ipdb> zarr_array = self.ds.create(name, shape=shape, dtype=dtype, fill_value=fill_value, **encoding)
ipdb> self.ds['x']._meta
{'zarr_format': 2, 'shape': (3,), 'chunks': (3,), 'dtype': dtype('O'), 'compressor': {'blocksize': 0, 'clevel': 5, 'cname': 'lz4', 'id': 'blosc', 'shuffle': 1}, 'fill_value': None, 'order': 'C', 'filters': [{'id': 'vlen-utf8'}, {'id': 'vlen-utf8'}]}

ipdb> self.ds['x']._meta['filters']
[{'id': 'vlen-utf8'}, {'id': 'vlen-utf8'}]

As @borispf and @jsadler2 suggested, clearing the filters from encoding before initiating the zarr store creation works:

ipdb> enc # without filters
{'chunks': (3,), 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': []}
ipdb> zarr_array = self.ds.create(name, shape=shape, dtype=dtype, fill_value=fill_value, **enc)
ipdb> self.ds['x']._meta['filters']
[{'id': 'vlen-utf8'}]

@jhamman since you are more familiar with the internals of zarr + xarray, should we default to ignoring filter information from xarray and let zarr take care of the extraction of filter information?

jhamman commented 3 years ago

Thanks @andersy005 for the write up and for digging into this. Doesn't this seem like it could be a bug in Zarr's create method?

Hoeze commented 3 years ago

Hi, I also keep running into this issue all the time. Right now, there is no way of round-tripping xr.open_zarr().to_zarr(), also because of https://github.com/pydata/xarray/issues/5219.

The only workaround that seems to help is the following:

to_store = xrds.copy()
for var in to_store.variables:
    to_store[var].encoding.clear()

bluetyson commented 2 years ago

Thanks for that, looks like I just came across this.

RichardScottOZ commented 2 years ago

Think it worked with one variable - @Hoeze workaround was necessary for more than one.

delgadom commented 2 years ago

This has been happening a lot to me lately when writing to zarr. Thanks to @bolliger32 for the tip - this usually works like a charm:

for v in list(ds.coords.keys()):
    if ds.coords[v].dtype == object:
        ds.coords[v] = ds.coords[v].astype("unicode")

for v in list(ds.variables.keys()):
    if ds[v].dtype == object:
        ds[v] = ds[v].astype("unicode")

For whatever reason, clearing the encoding and/or using .astype(str) doesn't seem to work as reliably. I don't have a good MRE for this but hope it helps others with the same issue.

note the flag raised by @FlorisCalkoen below - don't just throw this at all your writes! there are other object types (e.g. CFTime) which you probably don't want to convert to string. This is just a patch to get around this issue for dataarrays with string coords/variables.

FlorisCalkoen commented 2 years ago

@delgadom I just noticed that your proposed solution has the side effect of also converting cftime objects (e.g., below) to unicode strings.

xarray.DataArray 'time' (time: 1) array([cftime.DatetimeNoLeap(2007, 7, 2, 12, 0, 0, 0, has_year_zero=True)], dtype=object)

I updated your lines using @Hoeze' clear function and that seems to work for now.

    for v in list(ds.coords.keys()):
        if ds.coords[v].dtype == object:
            ds[v].encoding.clear()

    for v in list(ds.variables.keys()):
        if ds[v].dtype == object:
            ds[v].encoding.clear()

delgadom commented 2 years ago

ha - yeah that's a good flag. I definitely didn't intend for that to be a universally applied patch! so probably should have included a buyer beware. but we did find that clearing the encoding doesn't always do the trick for string arrays. So a comprehensive patch will probably need to be more nuanced.

saschahofmann commented 1 year ago

My impression is that keeping the zarr encoding leads to a bunch of issues (see my issue above) or the current one. There also seems to be an issue with chunking being preserved because the array encodings arent overwritten, but I cant find that issue right now. Since all of these issues are resolved by popping the zarr encoding I am wondering what are the downsides of this and whether it'd be easier to not keep that encoding at all?

max-sixty commented 11 months ago

I find this annoying too!

I think we can fold it into https://github.com/pydata/xarray/issues/6323, though it's also common enough that maybe we leave it open for future travellers

cisaacstern commented 8 months ago

I hit this issue this week, and the workaround in https://github.com/pydata/xarray/issues/3476#issuecomment-1115045538 worked for me.

DahnJ commented 4 months ago

I run into this issue fairly commonly (usually during playing copying of a dataset xr.open_zarr(..).to_zarr(..)) and the fix in https://github.com/pydata/xarray/issues/3476#issuecomment-1115045538 works for me.

pydata / xarray