pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

Uncompressed Zarr arrays can no longer be written to Zarr #4681

Open forman opened 3 years ago

forman commented 3 years ago

What happened:

We create xarray.Dataset instances using xr.open_zarr(store) with custom chunk store instances. These will lazily fetch data chunks for data variables from the Sentinel Hub API. For coordinate variables lon, lat, time we use "static" store entries: uncompressed, bytified numpy arrays.

Since xarray 0.16.2 and Zarr 2.6.1 this approach doesnt work anymore. When we write datasets opened from such store using xr.to_zarr(dst_store), e.g. with a dst_store=s3fs.S3Map(), we get encoding errors. E.g. for a coordinate array lon we get from botocore:

Invalid type for parameter Body, value: [55.0475 55.0465 55.0455 ... 53.0025 53.0015 53.0005], type: <class 'numpy.ndarray'>, valid types: <class 'bytes'>, <class 'bytearray'>, file-like object

(Full traceback is below.) It seems that our static numpy arrays won't be encoded at all, because they are uncompressed. If we use a compressor, it works again. (That's our current workaround.)

What you expected to happen:

Before data is written into a Zarr chunk store, it must be encoded from numpy arrays to bytes. This does not seem to happen if uncompressed data is written, that is, the the Zarr encoding's compressor and filters are both None.

Minimal Complete Verifiable Example:

A minimal, self-contained example is the entire test module test_reprod_27.py of the xcube Sentinel Hub plugin xcube-sh.

Original issue in the Sentinel Hub xcube plugin is xcube-sh #27.

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Nov 27 2020, 18:58:29) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: de_DE.cp1252 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.2 pandas: 1.1.5 numpy: 1.19.4 scipy: 1.5.3 netCDF4: 1.5.5 pydap: installed h5netcdf: None h5py: None Nio: None zarr: 2.6.1 cftime: 1.3.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.5 cfgrib: None iris: None bottleneck: None dask: 2.30.0 distributed: 2.30.1 matplotlib: 3.3.3 cartopy: None seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20201009 pip: 20.3.1 conda: None pytest: 6.1.2 IPython: 7.19.0 sphinx: 3.3.1

Traceback:

traceback:

File "D:\Projects\xcube\xcube\cli\_gen2\write.py", line 47, in write_cube
data_id = writer.write_data(cube,
File "D:\Projects\xcube\xcube\core\store\stores\s3.py", line 213, in write_data
self._new_s3_writer(writer_id).write_data(data, data_id=path, replace=replace, **write_params)
File "D:\Projects\xcube\xcube\core\store\accessors\dataset.py", line 313, in write_data
data.to_zarr(s3fs.S3Map(root=f'{bucket_name}/{data_id}' if bucket_name else data_id,
File "D:\Miniconda3\envs\xcube\lib\site-packages\xarray\core\dataset.py", line 1745, in to_zarr
return to_zarr(
File "D:\Miniconda3\envs\xcube\lib\site-packages\xarray\backends\api.py", line 1481, in to_zarr
dump_to_store(dataset, zstore, writer, encoding=encoding)
File "D:\Miniconda3\envs\xcube\lib\site-packages\xarray\backends\api.py", line 1158, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File "D:\Miniconda3\envs\xcube\lib\site-packages\xarray\backends\zarr.py", line 473, in store
self.set_variables(
File "D:\Miniconda3\envs\xcube\lib\site-packages\xarray\backends\zarr.py", line 549, in set_variables
writer.add(v.data, zarr_array, region)
File "D:\Miniconda3\envs\xcube\lib\site-packages\xarray\backends\common.py", line 143, in add
target[region] = source
File "D:\Miniconda3\envs\xcube\lib\site-packages\zarr\core.py", line 1122, in _setitem_
self.set_basic_selection(selection, value, fields=fields)
File "D:\Miniconda3\envs\xcube\lib\site-packages\zarr\core.py", line 1217, in set_basic_selection
return self._set_basic_selection_nd(selection, value, fields=fields)
File "D:\Miniconda3\envs\xcube\lib\site-packages\zarr\core.py", line 1508, in _set_basic_selection_nd
self._set_selection(indexer, value, fields=fields)
File "D:\Miniconda3\envs\xcube\lib\site-packages\zarr\core.py", line 1580, in _set_selection
self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
File "D:\Miniconda3\envs\xcube\lib\site-packages\zarr\core.py", line 1709, in _chunk_setitems
self.chunk_store.setitems({k: v for k, v in zip(ckeys, cdatas)})
File "D:\Miniconda3\envs\xcube\lib\site-packages\fsspec\mapping.py", line 110, in setitems
self.fs.pipe(values)
File "D:\Miniconda3\envs\xcube\lib\site-packages\fsspec\asyn.py", line 121, in wrapper
return maybe_sync(func, self, args, *kwargs)
File "D:\Miniconda3\envs\xcube\lib\site-packages\fsspec\asyn.py", line 100, in maybe_sync
return sync(loop, func, args, *kwargs)
File "D:\Miniconda3\envs\xcube\lib\site-packages\fsspec\asyn.py", line 71, in sync
raise exc.with_traceback(tb)
File "D:\Miniconda3\envs\xcube\lib\site-packages\fsspec\asyn.py", line 55, in f
result[0] = await future
File "D:\Miniconda3\envs\xcube\lib\site-packages\fsspec\asyn.py", line 211, in _pipe
await asyncio.gather(
File "D:\Miniconda3\envs\xcube\lib\site-packages\s3fs\core.py", line 608, in _pipe_file
return await self._call_s3(
File "D:\Miniconda3\envs\xcube\lib\site-packages\s3fs\core.py", line 225, in _call_s3
raise translate_boto_error(err) from err
File "D:\Miniconda3\envs\xcube\lib\site-packages\s3fs\core.py", line 207, in _call_s3
return await method(**additional_kwargs)
File "D:\Miniconda3\envs\xcube\lib\site-packages\aiobotocore\client.py", line 123, in _make_api_call
request_dict = await self._convert_to_request_dict(
File "D:\Miniconda3\envs\xcube\lib\site-packages\aiobotocore\client.py", line 171, in _convert_to_request_dict
request_dict = self._serializer.serialize_to_request(
File "D:\Miniconda3\envs\xcube\lib\site-packages\botocore\validate.py", line 297, in serialize_to_request
raise ParamValidationError(report=report.generate_report())

Invalid type for parameter Body, value: [55.0475 55.0465 55.0455 ... 53.0025 53.0015 53.0005], type: <class 'numpy.ndarray'>, valid types: <class 'bytes'>, <class 'bytearray'>, file-like object
forman commented 3 years ago

After debugging we found that zarr.core.Array._encode_chunk() does not encode chunks, if both compressor and filters are missing,

However I could not reproduce our problem with Zarr open/save alone. It seems to occur only when using xarrays.open_zarr() and xr.Dataset.to_zarr(). Therefore I seems to be an xarray issue rather than a Zarr one.

max-sixty commented 11 months ago

This is from a while ago now, sorry it didn't get much attention originally.

To the extent this is still an issue — does passing kwargs to the store allow this to work? This is a new-ish feature of .to_zarr:


chunkmanager_store_kwargs : dict, optional
    Additional keyword arguments passed on to the `ChunkManager.store` method used to store
    chunked arrays. For example for a dask array additional kwargs will be passed eventually to
    :py:func:`dask.array.store()`. Experimental API that should not be relied upon.

I think plausibly xarray should let this level of customization to work by allowing folks to pass args through to the underlying library, even if it doesn't support it natively.