pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.51k stars 1.05k forks source link

S3 read errors are sometimes ignored #5910

Open lnicola opened 2 years ago

lnicola commented 2 years ago

What happened:

I have a Zarr dataset on S3 that probably has wrong ACLs set on its time data chunks. zarr correctly complains about not being able to read it, but xarray tries to convert uninitialized memory to dates.

This seems specific to S3, I couldn't reproduce it with a local dataset. Sorry for missing a MCVE, but S3 makes this non-trivial.

Anything else we need to know?:

>>> ds = xr.open_zarr(store)
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/xarray/coding/times.py", line 236, in decode_cf_datetime
    dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)
  File "/usr/lib/python3.9/site-packages/xarray/coding/times.py", line 192, in _decode_datetime_with_pandas
    pd.to_timedelta(flat_num_dates.min(), delta) + ref_date
  File "/usr/lib/python3.9/site-packages/pandas/core/tools/timedeltas.py", line 142, in to_timedelta
    return _coerce_scalar_to_timedelta_type(arg, unit=unit, errors=errors)
  File "/usr/lib/python3.9/site-packages/pandas/core/tools/timedeltas.py", line 150, in _coerce_scalar_to_timedelta_type
    result = Timedelta(r, unit)
  File "pandas/_libs/tslibs/timedeltas.pyx", line 1311, in pandas._libs.tslibs.timedeltas.Timedelta.__new__
  File "pandas/_libs/tslibs/timedeltas.pyx", line 288, in pandas._libs.tslibs.timedeltas.convert_to_timedelta64
  File "pandas/_libs/tslibs/conversion.pyx", line 125, in pandas._libs.tslibs.conversion.cast_from_unit
OverflowError: Python int too large to convert to C long

>>> ds = xr.open_zarr(store, decode_times=False)
>>> ds.time
<xarray.DataArray 'time' (time: 328)>
array([     140573299617952,      140573299617952,       94122424261888, ...,
            140572816875120, -7159070105542061283,      140572816872816])
Coordinates:
  * time     (time)
 int64 140573299617952 140573299617952 ... 140572816872816
Attributes:
    calendar:  proleptic_gregorian
    units:     seconds since 2000-01-01
>>> ds = xr.open_zarr(store, decode_times=False)
>>> ds.time
<xarray.DataArray 'time' (time: 328)>
array([140573299617808, 140573299617808,  94122424160064, ..., 140572848003120,
       140572848003248, 140572848003376])
Coordinates:
  * time     (time)
 int64 140573299617808 140573299617808 ... 140572848003376
Attributes:
    calendar:  proleptic_gregorian
    units:     seconds since 2000-01-01

Notice the different time values:

 int64 140573299617952 140573299617952 ... 140572816872816
 int64 140573299617808 140573299617808 ... 140572848003376

zarr reports the error correctly:

>>> ds = zarr.open(store)
>>> ds.time
Traceback (most recent call last):
  File "/home/me/.local/lib/python3.9/site-packages/s3fs/core.py", line 248, in _call_s3
    out = await method(**additional_kwargs)
  File "/home/me/.local/lib/python3.9/site-packages/aiobotocore/client.py", line 155, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

Environment:

Output of xr.show_versions() ``` INSTALLED VERSIONS ------------------ commit: None python: 3.9.7 (default, Oct 10 2021, 15:13:22) [GCC 11.1.0] python-bits: 64 OS: Linux OS-release: 5.14.14-arch1-1 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.12.1 libnetcdf: None xarray: 0.19.0 pandas: 1.3.3 numpy: 1.21.2 scipy: 1.7.1 netCDF4: None pydap: None h5netcdf: None h5py: 3.5.0 Nio: None zarr: 2.10.2 cftime: 1.5.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.09.0 distributed: None matplotlib: 3.4.3 cartopy: None seaborn: None numbagg: None pint: None setuptools: 57.4.0 pip: 20.3.4 conda: None pytest: None IPython: 7.28.0 sphinx: None ```
lnicola commented 2 years ago

I traced this until somewhere in ZarrBackendEntrypoint. StoreBackendEntrypoint seems to report Access Denied, but if I call ZarrBackendEntrypoint.open_dataset() I get the conversion error. It might be related to the ZarrStore.open_groups parameters.