ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.77k stars 5.75k forks source link

[Datasets] [Bug] Access error when reading public data from S3 if no local AWS credentials are configured #19799

Closed robertnishihara closed 2 years ago

robertnishihara commented 3 years ago

Search before asking

Ray Component

Others

What happened + What you expected to happen

I ran the example on this page https://www.ray.io/ray-datasets

In particular

import ray

# read parquet from S3
parquet_path = "s3://ursa-labs-taxi-data/2019/06/data.parquet"
ds = ray.data.read_parquet(parquet_path)

It failed with

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-db174e281e4a> in <module>
      1 parquet_path = "s3://ursa-labs-taxi-data/2019/06/data.parquet"
----> 2 ds = ray.data.read_parquet(parquet_path)

~/opt/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py in read_parquet(paths, filesystem, columns, parallelism, ray_remote_args, **arrow_parquet_args)
    219         columns=columns,
    220         ray_remote_args=ray_remote_args,
--> 221         **arrow_parquet_args)
    222 
    223 

~/opt/anaconda3/lib/python3.7/site-packages/ray/data/read_api.py in read_datasource(datasource, parallelism, ray_remote_args, **read_args)
    149     """
    150 
--> 151     read_tasks = datasource.prepare_read(parallelism, **read_args)
    152 
    153     def remote_read(task: ReadTask) -> Block:

~/opt/anaconda3/lib/python3.7/site-packages/ray/data/datasource/parquet_datasource.py in prepare_read(self, parallelism, paths, filesystem, columns, schema, **reader_args)
     38 
     39         paths, file_infos, filesystem = _resolve_paths_and_filesystem(
---> 40             paths, filesystem)
     41         file_sizes = [file_info.size for file_info in file_infos]
     42 

~/opt/anaconda3/lib/python3.7/site-packages/ray/data/datasource/file_based_datasource.py in _resolve_paths_and_filesystem(paths, filesystem)
    195     file_infos = []
    196     for path in resolved_paths:
--> 197         file_info = filesystem.get_file_info(path)
    198         if file_info.type == FileType.Directory:
    199             paths, file_infos_ = _expand_directory(path, filesystem)

~/opt/anaconda3/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.get_file_info()

~/opt/anaconda3/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/opt/anaconda3/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

OSError: When getting information for key '2019/06/data.parquet' in bucket 'ursa-labs-taxi-data': AWS Error [code 15]: No response body.

Versions / Dependencies

Ray: '1.6.0' Pyarrow: '4.0.1' Python: Python 3.7.4 OS: MacOS 10.15.7

Reproduction script

Included above

Anything else

No response

Are you willing to submit a PR?

clarkzinzow commented 2 years ago

Tried this with latest master and it worked without any issues, closing for now.

clarkzinzow commented 2 years ago

Reopening since it looks like this can get triggered if no local AWS credentials are configured, even though this is a public bucket. cc @dmatrix

dmatrix commented 2 years ago

It seems to get triggered in two scenarios: 1) no local AWS creds exist, or 2) if AWS creds are expired and need to be renewed.

clarkzinzow commented 2 years ago

Very strange given that this is a public bucket that should support anonymous access!

dmatrix commented 2 years ago

Yes, indeed. That's what baffles me too.

richardliaw commented 2 years ago

Wonder if this is also fixed with #25644 / https://github.com/ray-project/ray/pull/25673

richardliaw commented 2 years ago

Does it also happen with pandas?

dmatrix commented 2 years ago

Pandas work after installing pip install boto s3fs

import pandas as pd
import boto

if __name__ == "__main__":

    df = pd.read_parquet("s3://ursa-labs-taxi-data/2009/01/data.parquet")
    print(df.count())
vendor_id             14092413
pickup_at             14092413
dropoff_at            14092413
passenger_count       14092413
trip_distance         14092413
pickup_longitude      14092413
pickup_latitude       14092413
rate_code_id                 0
store_and_fwd_flag        1224
dropoff_longitude     14092413
dropoff_latitude      14092413
payment_type          14092413
fare_amount           14092413
extra                 14092413
mta_tax                      0
tip_amount            14092413
tolls_amount          14092413
total_amount          14092413
dtype: int64
dmatrix commented 2 years ago

The same code with Ray data fails. Only if you aws sso login does it then work. Which defeats the purpose if it's a public bucket as it should work anonymously. One shouldn't have to authenticate.

import ray

if __name__ == "__main__":
    ray.init()

    ny_taxi_ds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2009/01/data.parquet")
    print(ny_taxi_ds.count())
/usr/local/anaconda3/envs/ray-summit-training/bin/python /Users/jules/git-repos/misc-code/py/ray/ray_datasets/nyc_taxi.py
2022-06-17 17:42:08,550 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
  File "/Users/jules/git-repos/misc-code/py/ray/ray_datasets/nyc_taxi.py", line 6, in <module>
    ny_taxi_ds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2009/01/data.parquet")
  File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/data/read_api.py", line 324, in read_parquet
    return read_datasource(
  File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/data/read_api.py", line 240, in read_datasource
    read_tasks = ray.get(
  File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/worker.py", line 1831, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_prepare_read() (pid=97014, ip=127.0.0.1)
  File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/data/read_api.py", line 1041, in _prepare_read
    return ds.prepare_read(parallelism, **kwargs)
  File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 102, in prepare_read
    pq_ds = pq.ParquetDataset(
  File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/pyarrow/parquet.py", line 1309, in __new__
    return _ParquetDatasetV2(
  File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/pyarrow/parquet.py", line 1698, in __init__
    if filesystem.get_file_info(path_or_paths).is_file:
  File "pyarrow/_fs.pyx", line 439, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: When getting information for key '2009/01/data.parquet' in bucket 'ursa-labs-taxi-data': AWS Error [code 100]: No response body.
c21 commented 2 years ago

From my understanding, the S3 file s3://ursa-labs-taxi-data/2019/06/data.parquet permission is set to public and followed code should work successfully now:

>>> import ray
>>> ds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
2022-07-11 13:21:18,352 WARNING read_api.py:260 -- The number of blocks in this dataset (1) limits its parallelism to 1 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.
>>> ds.count()
6941024
pcmoritz commented 2 years ago

It looks like this is working without credentials now for read_parquet (possibly, I'm not 100% sure -- I tried to be as careful as possible to remove all credentials but I can't be sure there is none left).

This is not however currently working for ray.data.read_binary_files is seems.

EDIT: This was wrong -- ray.data.read_binary_files works on the taxi dataset without credentials too. It doesn't work on one of our own datasets even though it is publicly accessibly (e.g. via https). Some bucket policy might be configured incorrectly.

EDIT: We figured it out now, the bucket also needs to allow the action "s3:ListBucket" for the principal "*" -- before it only had "s3:GetObject" and "s3:GetObjectVersion". After the change, access now works without credentials.

pcmoritz commented 2 years ago

Btw, do we understand what is the difference from what @c21 and I are observing and what @dmatrix has observed before? Was there a code change on our side? Different pyarrow versions?

clarkzinzow commented 2 years ago

@pcmoritz Thank you for making that change to the bucket policy!

Btw, do we understand what is the difference from what @c21 and I are observing and what @dmatrix has observed before? Was there a code change on our side? Different pyarrow versions?

I don't think we understand that discrepancy yet, right @c21?

c21 commented 2 years ago

I don't think we understand that discrepancy yet, right @c21?

@clarkzinzow - I don't know given same observation between me and @pcmoritz.

~Ping again - @dmatrix could you help us try the example (e.g. https://github.com/ray-project/ray/issues/19799#issuecomment-1180830683) again in your environment? Thanks.~

pcmoritz commented 2 years ago

I followed up with @dmatrix offline and currently everything is working in his environment, but he will let us know if the error happens again.

Right now, I can only repro the problem in one setting, which is the CI (https://github.com/ray-project/ray/pull/26482). Will keep digging and seeing if I can find out more.

pcmoritz commented 2 years ago

I also found this, which seems to be the same issue: https://github.com/ray-project/ray/issues/18102

clarkzinzow commented 2 years ago

@richardliaw What's the current CI failure, do you have a traceback?

There's a code 15 error (access denied) and a code 100 error (unknown error), and we've at least fixed the non-CI cases of the latter.

pcmoritz commented 2 years ago

I think I now fully understand what is going on here -- after logging into our CI machine and running

(base) root@a050ee0b6e70:/ray# aws s3 ls s3://air-example-data/

An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

Even though the bucket is public, the role of the CI machine is not granting access to S3 (no access at all). So since pyarrow is picking up the credentials from that role, it will not allow us to access even that public bucket.

I think the follow up item on this to remove this class problem for users are: (a) Improve the error message (it should say explicitly that there is a permission error instead of "No response body" -- ideally it would also say which method it tried to call on the bucket for which the permission error happened) (b) document that we pick up S3 credentials and how we can use an anonymous user (d) document which privileges people need to give to their buckets so they can read their data with our library. Users will appreciate this since it is a little tricky to get right.

pcmoritz commented 2 years ago

For comparison, pandas is giving a much better error message here -- it even tells us which operation failed so we can debug the permissions (An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied):

In [3]: import pandas as pd

In [4]: df = pd.read_parquet("s3://air-example-data/ocr_tiny_dataset")
---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
    245             try:
--> 246                 out = await method(**additional_kwargs)
    247                 return out

/opt/miniconda/lib/python3.7/site-packages/aiobotocore/client.py in _make_api_call(self, operation_name, api_params)
    153             error_class = self.exceptions.from_code(error_code)
--> 154             raise error_class(parsed_response, operation_name)
    155         else:

ClientError: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

The above exception was the direct cause of the following exception:

PermissionError                           Traceback (most recent call last)
/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _info(self, path, bucket, key, refresh, version_id)
   1063                     try:
-> 1064                         out = await self._simple_info(path)
   1065                     except PermissionError:

/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _simple_info(self, path)
    983             MaxKeys=1,
--> 984             **self.req_kw,
    985         )

/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
    264                 err = e
--> 265         raise translate_boto_error(err)
    266

PermissionError: Access Denied

During handling of the above exception, another exception occurred:

ClientError                               Traceback (most recent call last)
/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
    245             try:
--> 246                 out = await method(**additional_kwargs)
    247                 return out

/opt/miniconda/lib/python3.7/site-packages/aiobotocore/client.py in _make_api_call(self, operation_name, api_params)
    153             error_class = self.exceptions.from_code(error_code)
--> 154             raise error_class(parsed_response, operation_name)
    155         else:

ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

The above exception was the direct cause of the following exception:

PermissionError                           Traceback (most recent call last)
<ipython-input-4-68e4ad4ae34f> in <module>
----> 1 df = pd.read_parquet("s3://air-example-data/ocr_tiny_dataset")

/opt/miniconda/lib/python3.7/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
    498         storage_options=storage_options,
    499         use_nullable_dtypes=use_nullable_dtypes,
--> 500         **kwargs,
    501     )

/opt/miniconda/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
    238         try:
    239             result = self.api.parquet.read_table(
--> 240                 path_or_handle, columns=columns, **kwargs
    241             ).to_pandas(**to_pandas_kwargs)
    242             if manager == "array":

/opt/miniconda/lib/python3.7/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit)
   1913                 ignore_prefixes=ignore_prefixes,
   1914                 pre_buffer=pre_buffer,
-> 1915                 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
   1916             )
   1917         except ImportError:

/opt/miniconda/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, **kwargs)
   1696                     except ValueError:
   1697                         filesystem = LocalFileSystem(use_mmap=memory_map)
-> 1698                 if filesystem.get_file_info(path_or_paths).is_file:
   1699                     single_file = path_or_paths
   1700             else:

/opt/miniconda/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.get_file_info()

/opt/miniconda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/opt/miniconda/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_get_file_info()

/opt/miniconda/lib/python3.7/site-packages/pyarrow/fs.py in get_file_info(self, paths)
    305         for path in paths:
    306             try:
--> 307                 info = self.fs.info(path)
    308             except FileNotFoundError:
    309                 infos.append(FileInfo(path, FileType.NotFound))

/opt/miniconda/lib/python3.7/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
     86     def wrapper(*args, **kwargs):
     87         self = obj or args[0]
---> 88         return sync(self.loop, func, *args, **kwargs)
     89
     90     return wrapper

/opt/miniconda/lib/python3.7/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
     67         raise FSTimeoutError
     68     if isinstance(result[0], BaseException):
---> 69         raise result[0]
     70     return result[0]
     71

/opt/miniconda/lib/python3.7/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
     23         coro = asyncio.wait_for(coro, timeout=timeout)
     24     try:
---> 25         result[0] = await coro
     26     except Exception as ex:
     27         result[0] = ex

/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _info(self, path, bucket, key, refresh, version_id)
   1066                         # If the permissions aren't enough for scanning a prefix
   1067                         # then fall back to using normal HEAD_OBJECT
-> 1068                         out = await self._version_aware_info(path, version_id)
   1069                 if out:
   1070                     return out

/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _version_aware_info(self, path, version_id)
   1015                 Key=key,
   1016                 **version_id_kw(version_id),
-> 1017                 **self.req_kw,
   1018             )
   1019         except FileNotFoundError:

/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
    263             except Exception as e:
    264                 err = e
--> 265         raise translate_boto_error(err)
    266
    267     call_s3 = sync_wrapper(_call_s3)

PermissionError: Forbidden

In [5]:
pcmoritz commented 2 years ago

This is now an anonymous user can be used -- it works even if the AWS role prevents S3 access:

s = ray.data.read_binary_files("s3://anonymous@air-example-data/ocr_tiny_dataset", include_paths=True)
c21 commented 2 years ago

I think the follow up item on this to remove this class problem for users are: (a) Improve the error message (it should say explicitly that there is a permission error instead of "No response body" -- ideally it would also say which method it tried to call on the bucket for which the permission error happened) (b) document that we pick up S3 credentials and how we can use an anonymous user (d) document which privileges people need to give to their buckets so they can read their data with our library. Users will appreciate this since it is a little tricky to get right.

@pcmoritz - thanks for digging into this. The plan sounds good to me. Let me start working on if we can improve the error message in dataset read APIs.

c21 commented 2 years ago

Having a PR ready for review to improve the error message to make it more actionable - https://github.com/ray-project/ray/pull/26619 .

dmatrix commented 2 years ago

Thanks to the team getting to the bottom & sorting this out!

richardliaw commented 2 years ago

It was the unknown error. How about lets try moving the examples onto the s3 bucket directly in a draft pr to see if it works?

On Wed, Jul 13, 2022 at 6:03 PM Clark Zinzow @.***> wrote:

@richardliaw https://github.com/richardliaw What's the current CI failure, do you have a traceback?

There's a code 15 error (access denied) and a code 100 error (unknown error), and we've at least fixed the non-CI cases of the latter.

— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/19799#issuecomment-1183835666, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCRZZO6PRUSMEWTC2AQHYTVT5RM5ANCNFSM5G27TJHQ . You are receiving this because you were mentioned.Message ID: @.***>

jf87 commented 1 year ago

I still am getting some strange error when executing the examples and accessing the S3 buckets.

import ray
dataset = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")
dataset.show(limit=1)
RayTaskError(OSError): ray::_get_read_tasks() (pid=97998, ip=10.10.1.152)
  File "/home/bdp23/miniconda3/envs/bdp/lib/python3.10/site-packages/ray/data/read_api.py", line 1595, in _get_read_tasks
    reader = ds.create_reader(**kwargs)
  File "/home/bdp23/miniconda3/envs/bdp/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 216, in create_reader
    return _FileBasedDatasourceReader(self, **kwargs)
  File "/home/bdp23/miniconda3/envs/bdp/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 378, in __init__
    paths, self._filesystem = _resolve_paths_and_filesystem(paths, filesystem)
  File "/home/bdp23/miniconda3/envs/bdp/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 639, in _resolve_paths_and_filesystem
    resolved_filesystem, resolved_path = _resolve_filesystem_and_path(
  File "/home/bdp23/miniconda3/envs/bdp/lib/python3.10/site-packages/pyarrow/fs.py", line 187, in _resolve_filesystem_and_path
    filesystem, path = FileSystem.from_uri(path)
  File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When resolving region for bucket 'air-example-data': AWS Error NETWORK_CONNECTION during HeadBucket operation: curlCode: 6, Couldn't resolve host name

Any idea? Do I need to setup AWS credentials?