pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.5k stars 1.04k forks source link

Reading zarr gives unspecific PermissionError: Access Denied when public data has been consolidated after being written to S3 #5918

Open adair-kovac opened 2 years ago

adair-kovac commented 2 years ago

This is a minor issue with an invalid data setup that's easy to get into, I'm reporting it here more for documentation than expecting a fix.

Quick summary: If you consolidate metadata on a public zarr dataset in S3, the .zmetadata files end up permission-restricted. So if you try reading with xarray 0.19, it gives an unclear error message and fails to open the dataset, while reading with xarray <=0.18.x goes fine. Contrast this with the nice warnings you get in 0.19 if the .zmetadata just doesn't exist.


How this happens

People who wrote data without consolidated=True in past versions of xarray might run into this issue, but in our case we actually did have that parameter originally.

I've previously reported a few issues with the fact that xarray will write arbitrary zarr hierarchies if the variable names contain slashes, and then can't read them properly. One consequence of this is that data written with consolidated=True still doesn't have .zmetadata files where they're needed for xarray.open_mfdataset to read them.

If you try to add the .zmetadata by running

fs = s3fs.S3FileSystem()
zarr.consolidate_metadata(s3fs.S3Map(s3_path, s3=fs))

directly on the cloud bucket in S3, it writes the .zmetadata files... but permissions are restricted to the user who uploaded them even if you're writing to a public bucket. (It's an AWS thing.)


Why it's a problem

It would be nice if when xarray goes to read this data, it would see that it has access to the data but not to any usable .zmetadata and spit out the warning like it does if .zmetadata doesn't exist. Instead it fails on an uncaught PermissionError: Access Denied, and it's not clear from the output that this is just a .zmetadata issue and the user can still get the data by passing consolidated=False.

Another problem with this situation is that it causes data that reads just fine in xarray 0.18.x without even a warning message to suddenly give Access Denied from the same code when you update to xarray 0.19.


Work around

If you're trying to read a dataset that has this issue, you can get the same behavior as in previous versions of xarray like so:

xr.open_mfdataset(s3_lookups, engine="zarr", consolidated=False)

The data loads fine, just more slowly than if the .zmetadata were accessible.


Stacktrace

---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
    245             try:
--> 246                 out = await method(**additional_kwargs)
    247                 return out

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/aiobotocore/client.py in _make_api_call(self, operation_name, api_params)
    154             error_class = self.exceptions.from_code(error_code)
--> 155             raise error_class(parsed_response, operation_name)
    156         else:

ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

The above exception was the direct cause of the following exception:

PermissionError                           Traceback (most recent call last)
/var/folders/xf/xwjm3rj52ls9780rvrbbb9tm0000gn/T/ipykernel_84970/645651303.py in <module>
     16 
     17 # Look up the data
---> 18 test_data = xr.open_mfdataset(s3_lookups, engine="zarr")
     19 test_data

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
    911         getattr_ = getattr
    912 
--> 913     datasets = [open_(p, **open_kwargs) for p in paths]
    914     closers = [getattr_(ds, "_close") for ds in datasets]
    915     if preprocess is not None:

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in <listcomp>(.0)
    911         getattr_ = getattr
    912 
--> 913     datasets = [open_(p, **open_kwargs) for p in paths]
    914     closers = [getattr_(ds, "_close") for ds in datasets]
    915     if preprocess is not None:

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    495 
    496     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 497     backend_ds = backend.open_dataset(
    498         filename_or_obj,
    499         drop_variables=drop_variables,

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, lock)
    824 
    825         filename_or_obj = _normalize_path(filename_or_obj)
--> 826         store = ZarrStore.open_group(
    827             filename_or_obj,
    828             group=group,

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel)
    367         if consolidated is None:
    368             try:
--> 369                 zarr_group = zarr.open_consolidated(store, **open_kwargs)
    370             except KeyError:
    371                 warnings.warn(

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/zarr/convenience.py in open_consolidated(store, metadata_key, mode, **kwargs)
   1176 
   1177     # setup metadata store
-> 1178     meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
   1179 
   1180     # pass through

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/zarr/storage.py in __init__(self, store, metadata_key)
   2767 
   2768         # retrieve consolidated metadata
-> 2769         meta = json_loads(store[metadata_key])
   2770 
   2771         # check format of consolidated metadata

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/mapping.py in __getitem__(self, key, default)
    131         k = self._key_to_str(key)
    132         try:
--> 133             result = self.fs.cat(k)
    134         except self.missing_exceptions:
    135             if default is not None:

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
     86     def wrapper(*args, **kwargs):
     87         self = obj or args[0]
---> 88         return sync(self.loop, func, *args, **kwargs)
     89 
     90     return wrapper

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
     67         raise FSTimeoutError
     68     if isinstance(result[0], BaseException):
---> 69         raise result[0]
     70     return result[0]
     71 

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
     23         coro = asyncio.wait_for(coro, timeout=timeout)
     24     try:
---> 25         result[0] = await coro
     26     except Exception as ex:
     27         result[0] = ex

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/fsspec/asyn.py in _cat(self, path, recursive, on_error, **kwargs)
    342             ex = next(filter(is_exception, out), False)
    343             if ex:
--> 344                 raise ex
    345         if (
    346             len(paths) > 1

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _cat_file(self, path, version_id, start, end)
    849         else:
    850             head = {}
--> 851         resp = await self._call_s3(
    852             "get_object",
    853             Bucket=bucket,

/opt/anaconda3/envs/gefs/lib/python3.9/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
    263             except Exception as e:
    264                 err = e
--> 265         raise translate_boto_error(err)
    266 
    267     call_s3 = sync_wrapper(_call_s3)

PermissionError: Access Denied
rabernat commented 2 years ago

Maybe @martindurant has some insights?

martindurant commented 2 years ago

Some thoughts: