Closed robertnishihara closed 2 years ago
Tried this with latest master and it worked without any issues, closing for now.
Reopening since it looks like this can get triggered if no local AWS credentials are configured, even though this is a public bucket. cc @dmatrix
It seems to get triggered in two scenarios: 1) no local AWS creds exist, or 2) if AWS creds are expired and need to be renewed.
Very strange given that this is a public bucket that should support anonymous access!
Yes, indeed. That's what baffles me too.
Wonder if this is also fixed with #25644 / https://github.com/ray-project/ray/pull/25673
Does it also happen with pandas?
Pandas work after installing pip install boto s3fs
import pandas as pd
import boto
if __name__ == "__main__":
df = pd.read_parquet("s3://ursa-labs-taxi-data/2009/01/data.parquet")
print(df.count())
vendor_id 14092413
pickup_at 14092413
dropoff_at 14092413
passenger_count 14092413
trip_distance 14092413
pickup_longitude 14092413
pickup_latitude 14092413
rate_code_id 0
store_and_fwd_flag 1224
dropoff_longitude 14092413
dropoff_latitude 14092413
payment_type 14092413
fare_amount 14092413
extra 14092413
mta_tax 0
tip_amount 14092413
tolls_amount 14092413
total_amount 14092413
dtype: int64
The same code with Ray data fails. Only if you aws sso login
does it then work. Which defeats the purpose if it's a public bucket as it should work anonymously. One shouldn't have to authenticate.
import ray
if __name__ == "__main__":
ray.init()
ny_taxi_ds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2009/01/data.parquet")
print(ny_taxi_ds.count())
/usr/local/anaconda3/envs/ray-summit-training/bin/python /Users/jules/git-repos/misc-code/py/ray/ray_datasets/nyc_taxi.py
2022-06-17 17:42:08,550 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
File "/Users/jules/git-repos/misc-code/py/ray/ray_datasets/nyc_taxi.py", line 6, in <module>
ny_taxi_ds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2009/01/data.parquet")
File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/data/read_api.py", line 324, in read_parquet
return read_datasource(
File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/data/read_api.py", line 240, in read_datasource
read_tasks = ray.get(
File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/worker.py", line 1831, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_prepare_read() (pid=97014, ip=127.0.0.1)
File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/data/read_api.py", line 1041, in _prepare_read
return ds.prepare_read(parallelism, **kwargs)
File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 102, in prepare_read
pq_ds = pq.ParquetDataset(
File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/pyarrow/parquet.py", line 1309, in __new__
return _ParquetDatasetV2(
File "/usr/local/anaconda3/envs/ray-summit-training/lib/python3.8/site-packages/pyarrow/parquet.py", line 1698, in __init__
if filesystem.get_file_info(path_or_paths).is_file:
File "pyarrow/_fs.pyx", line 439, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: When getting information for key '2009/01/data.parquet' in bucket 'ursa-labs-taxi-data': AWS Error [code 100]: No response body.
From my understanding, the S3 file s3://ursa-labs-taxi-data/2019/06/data.parquet
permission is set to public and followed code should work successfully now:
>>> import ray
>>> ds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
2022-07-11 13:21:18,352 WARNING read_api.py:260 -- The number of blocks in this dataset (1) limits its parallelism to 1 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.
>>> ds.count()
6941024
It looks like this is working without credentials now for read_parquet
(possibly, I'm not 100% sure -- I tried to be as careful as possible to remove all credentials but I can't be sure there is none left).
This is not however currently working for ray.data.read_binary_files
is seems.
EDIT: This was wrong -- ray.data.read_binary_files
works on the taxi dataset without credentials too. It doesn't work on one of our own datasets even though it is publicly accessibly (e.g. via https). Some bucket policy might be configured incorrectly.
EDIT: We figured it out now, the bucket also needs to allow the action "s3:ListBucket"
for the principal "*"
-- before it only had "s3:GetObject"
and "s3:GetObjectVersion"
. After the change, access now works without credentials.
Btw, do we understand what is the difference from what @c21 and I are observing and what @dmatrix has observed before? Was there a code change on our side? Different pyarrow versions?
@pcmoritz Thank you for making that change to the bucket policy!
Btw, do we understand what is the difference from what @c21 and I are observing and what @dmatrix has observed before? Was there a code change on our side? Different pyarrow versions?
I don't think we understand that discrepancy yet, right @c21?
I don't think we understand that discrepancy yet, right @c21?
@clarkzinzow - I don't know given same observation between me and @pcmoritz.
~Ping again - @dmatrix could you help us try the example (e.g. https://github.com/ray-project/ray/issues/19799#issuecomment-1180830683) again in your environment? Thanks.~
I followed up with @dmatrix offline and currently everything is working in his environment, but he will let us know if the error happens again.
Right now, I can only repro the problem in one setting, which is the CI (https://github.com/ray-project/ray/pull/26482). Will keep digging and seeing if I can find out more.
I also found this, which seems to be the same issue: https://github.com/ray-project/ray/issues/18102
@richardliaw What's the current CI failure, do you have a traceback?
There's a code 15 error (access denied) and a code 100 error (unknown error), and we've at least fixed the non-CI cases of the latter.
I think I now fully understand what is going on here -- after logging into our CI machine and running
(base) root@a050ee0b6e70:/ray# aws s3 ls s3://air-example-data/
An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
Even though the bucket is public, the role of the CI machine is not granting access to S3 (no access at all). So since pyarrow is picking up the credentials from that role, it will not allow us to access even that public bucket.
I think the follow up item on this to remove this class problem for users are: (a) Improve the error message (it should say explicitly that there is a permission error instead of "No response body" -- ideally it would also say which method it tried to call on the bucket for which the permission error happened) (b) document that we pick up S3 credentials and how we can use an anonymous user (d) document which privileges people need to give to their buckets so they can read their data with our library. Users will appreciate this since it is a little tricky to get right.
For comparison, pandas is giving a much better error message here -- it even tells us which operation failed so we can debug the permissions (An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
):
In [3]: import pandas as pd
In [4]: df = pd.read_parquet("s3://air-example-data/ocr_tiny_dataset")
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
245 try:
--> 246 out = await method(**additional_kwargs)
247 return out
/opt/miniconda/lib/python3.7/site-packages/aiobotocore/client.py in _make_api_call(self, operation_name, api_params)
153 error_class = self.exceptions.from_code(error_code)
--> 154 raise error_class(parsed_response, operation_name)
155 else:
ClientError: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
The above exception was the direct cause of the following exception:
PermissionError Traceback (most recent call last)
/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _info(self, path, bucket, key, refresh, version_id)
1063 try:
-> 1064 out = await self._simple_info(path)
1065 except PermissionError:
/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _simple_info(self, path)
983 MaxKeys=1,
--> 984 **self.req_kw,
985 )
/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
264 err = e
--> 265 raise translate_boto_error(err)
266
PermissionError: Access Denied
During handling of the above exception, another exception occurred:
ClientError Traceback (most recent call last)
/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
245 try:
--> 246 out = await method(**additional_kwargs)
247 return out
/opt/miniconda/lib/python3.7/site-packages/aiobotocore/client.py in _make_api_call(self, operation_name, api_params)
153 error_class = self.exceptions.from_code(error_code)
--> 154 raise error_class(parsed_response, operation_name)
155 else:
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
The above exception was the direct cause of the following exception:
PermissionError Traceback (most recent call last)
<ipython-input-4-68e4ad4ae34f> in <module>
----> 1 df = pd.read_parquet("s3://air-example-data/ocr_tiny_dataset")
/opt/miniconda/lib/python3.7/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
498 storage_options=storage_options,
499 use_nullable_dtypes=use_nullable_dtypes,
--> 500 **kwargs,
501 )
/opt/miniconda/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
238 try:
239 result = self.api.parquet.read_table(
--> 240 path_or_handle, columns=columns, **kwargs
241 ).to_pandas(**to_pandas_kwargs)
242 if manager == "array":
/opt/miniconda/lib/python3.7/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit)
1913 ignore_prefixes=ignore_prefixes,
1914 pre_buffer=pre_buffer,
-> 1915 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
1916 )
1917 except ImportError:
/opt/miniconda/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, **kwargs)
1696 except ValueError:
1697 filesystem = LocalFileSystem(use_mmap=memory_map)
-> 1698 if filesystem.get_file_info(path_or_paths).is_file:
1699 single_file = path_or_paths
1700 else:
/opt/miniconda/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.get_file_info()
/opt/miniconda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/opt/miniconda/lib/python3.7/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_get_file_info()
/opt/miniconda/lib/python3.7/site-packages/pyarrow/fs.py in get_file_info(self, paths)
305 for path in paths:
306 try:
--> 307 info = self.fs.info(path)
308 except FileNotFoundError:
309 infos.append(FileInfo(path, FileType.NotFound))
/opt/miniconda/lib/python3.7/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
86 def wrapper(*args, **kwargs):
87 self = obj or args[0]
---> 88 return sync(self.loop, func, *args, **kwargs)
89
90 return wrapper
/opt/miniconda/lib/python3.7/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
67 raise FSTimeoutError
68 if isinstance(result[0], BaseException):
---> 69 raise result[0]
70 return result[0]
71
/opt/miniconda/lib/python3.7/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
23 coro = asyncio.wait_for(coro, timeout=timeout)
24 try:
---> 25 result[0] = await coro
26 except Exception as ex:
27 result[0] = ex
/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _info(self, path, bucket, key, refresh, version_id)
1066 # If the permissions aren't enough for scanning a prefix
1067 # then fall back to using normal HEAD_OBJECT
-> 1068 out = await self._version_aware_info(path, version_id)
1069 if out:
1070 return out
/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _version_aware_info(self, path, version_id)
1015 Key=key,
1016 **version_id_kw(version_id),
-> 1017 **self.req_kw,
1018 )
1019 except FileNotFoundError:
/opt/miniconda/lib/python3.7/site-packages/s3fs/core.py in _call_s3(self, method, *akwarglist, **kwargs)
263 except Exception as e:
264 err = e
--> 265 raise translate_boto_error(err)
266
267 call_s3 = sync_wrapper(_call_s3)
PermissionError: Forbidden
In [5]:
This is now an anonymous user can be used -- it works even if the AWS role prevents S3 access:
s = ray.data.read_binary_files("s3://anonymous@air-example-data/ocr_tiny_dataset", include_paths=True)
I think the follow up item on this to remove this class problem for users are: (a) Improve the error message (it should say explicitly that there is a permission error instead of "No response body" -- ideally it would also say which method it tried to call on the bucket for which the permission error happened) (b) document that we pick up S3 credentials and how we can use an anonymous user (d) document which privileges people need to give to their buckets so they can read their data with our library. Users will appreciate this since it is a little tricky to get right.
@pcmoritz - thanks for digging into this. The plan sounds good to me. Let me start working on if we can improve the error message in dataset read APIs.
Having a PR ready for review to improve the error message to make it more actionable - https://github.com/ray-project/ray/pull/26619 .
Thanks to the team getting to the bottom & sorting this out!
It was the unknown error. How about lets try moving the examples onto the s3 bucket directly in a draft pr to see if it works?
On Wed, Jul 13, 2022 at 6:03 PM Clark Zinzow @.***> wrote:
@richardliaw https://github.com/richardliaw What's the current CI failure, do you have a traceback?
There's a code 15 error (access denied) and a code 100 error (unknown error), and we've at least fixed the non-CI cases of the latter.
— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/19799#issuecomment-1183835666, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCRZZO6PRUSMEWTC2AQHYTVT5RM5ANCNFSM5G27TJHQ . You are receiving this because you were mentioned.Message ID: @.***>
I still am getting some strange error when executing the examples and accessing the S3 buckets.
import ray
dataset = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")
dataset.show(limit=1)
RayTaskError(OSError): ray::_get_read_tasks() (pid=97998, ip=10.10.1.152)
File "/home/bdp23/miniconda3/envs/bdp/lib/python3.10/site-packages/ray/data/read_api.py", line 1595, in _get_read_tasks
reader = ds.create_reader(**kwargs)
File "/home/bdp23/miniconda3/envs/bdp/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 216, in create_reader
return _FileBasedDatasourceReader(self, **kwargs)
File "/home/bdp23/miniconda3/envs/bdp/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 378, in __init__
paths, self._filesystem = _resolve_paths_and_filesystem(paths, filesystem)
File "/home/bdp23/miniconda3/envs/bdp/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 639, in _resolve_paths_and_filesystem
resolved_filesystem, resolved_path = _resolve_filesystem_and_path(
File "/home/bdp23/miniconda3/envs/bdp/lib/python3.10/site-packages/pyarrow/fs.py", line 187, in _resolve_filesystem_and_path
filesystem, path = FileSystem.from_uri(path)
File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When resolving region for bucket 'air-example-data': AWS Error NETWORK_CONNECTION during HeadBucket operation: curlCode: 6, Couldn't resolve host name
Any idea? Do I need to setup AWS credentials?
Search before asking
Ray Component
Others
What happened + What you expected to happen
I ran the example on this page https://www.ray.io/ray-datasets
In particular
It failed with
Versions / Dependencies
Ray: '1.6.0' Pyarrow: '4.0.1' Python: Python 3.7.4 OS: MacOS 10.15.7
Reproduction script
Included above
Anything else
No response
Are you willing to submit a PR?