Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# first in one terminal, start a moto standalone server with `moto_server -p 5555`
import boto3
import os
import pandas
import pyarrow.fs
def test_pandas_read_orc():
endpoint_port = f"5555"
endpoint_uri = f"http://localhost:{endpoint_port}/"
region = "us-east-1"
os.environ["AWS_ACCESS_KEY_ID"] = "fake"
os.environ["AWS_SECRET_ACCESS_KEY"] = "fake"
os.environ["AWS_SECURITY_TOKEN"] = "fake"
os.environ["AWS_SESSION_TOKEN"] = "fake"
s3_resource = boto3.resource("s3", endpoint_url=endpoint_uri, region_name=region)
bucket_name = "mybucket"
s3_resource.Bucket(bucket_name).create()
s3_resource.Bucket(bucket_name).upload_file(
"userdata1.orc",
"userdata1.orc",
)
filesystem = pyarrow.fs.S3FileSystem(endpoint_override=endpoint_uri, region=region)
print(
filesystem.get_file_info("mybucket/userdata1.orc")
) # outputs <FileInfo for 'mybucket/userdata1.orc': type=FileType.File, size=119367>,
# proving the filesystem itself contacts the moto standalone server
df = pandas.read_orc("s3://mybucket/userdata1.orc", filesystem=filesystem)
# raises botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
test_pandas_read_orc()
Issue Description
Pandas does not respect the filesystem given to .read_orc() when getting a handle for the file. This means if you provide a mocked s3 filesystem backend, pandas will bypass that and try to contact the real s3 backend, making unit tests with a mocked s3 impossible, and potentially dangerous!
Here is a sample ORC file which I had next to the test file to upload to the mock s3 server (remove the .zip file extension as github doesn't support uploading .orc files, but this is in fact an ORC file as is) for retrieval:
userdata1.orc.zip. Note that you can reproduce this with an invalid .orc file as the error happens before reading any ORC data.
Error produced:
Traceback (most recent call last):
File ".venv/lib/python3.10/site-packages/s3fs/core.py", line 113, in _error_wrapper
return await func(*args, **kwargs)
File ".venv/lib/python3.10/site-packages/aiobotocore/client.py", line 409, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test_s3_pandas_min.py", line 36, in <module>
test_pandas_read_orc()
File "test_s3_pandas_min.py", line 23, in test_pandas_read_orc
df = pandas.read_orc("s3://mybucket/userdata1.orc", filesystem=filesystem)
File ".venv/lib/python3.10/site-packages/pandas/io/orc.py", line 109, in read_orc
with get_handle(path, "rb", is_text=False) as handles:
File ".venv/lib/python3.10/site-packages/pandas/io/common.py", line 730, in get_handle
ioargs = _get_filepath_or_buffer(
File ".venv/lib/python3.10/site-packages/pandas/io/common.py", line 443, in _get_filepath_or_buffer
).open()
File ".venv/lib/python3.10/site-packages/fsspec/core.py", line 135, in open
return self.__enter__()
...
Expected Behavior
I would expect the .read_orc function to fully use the filesystem provided instead of trying to talk to the real s3, and succeed at reading the orc file.
My initial investigation
before the .read_table call happens, it is erroring at the get_handle() call with PermissionError('Forbidden') .get_handle() is not using the custom filesystem I provided, and read_table doesn't allow passing through storage_options (even though _get_filepath_or_buffer does accept that).
Pandas version checks
Reproducible Example
Issue Description
Pandas does not respect the filesystem given to
.read_orc()
when getting a handle for the file. This means if you provide a mocked s3 filesystem backend, pandas will bypass that and try to contact the real s3 backend, making unit tests with a mocked s3 impossible, and potentially dangerous!Here is a sample ORC file which I had next to the test file to upload to the mock s3 server (remove the
.zip
file extension as github doesn't support uploading.orc
files, but this is in fact an ORC file as is) for retrieval: userdata1.orc.zip. Note that you can reproduce this with an invalid.orc
file as the error happens before reading any ORC data.Error produced:
Expected Behavior
I would expect the
.read_orc
function to fully use the filesystem provided instead of trying to talk to the real s3, and succeed at reading the orc file.My initial investigation
before the .read_table call happens, it is erroring at the get_handle() call with PermissionError('Forbidden') .get_handle() is not using the custom filesystem I provided, and read_table doesn't allow passing through storage_options (even though _get_filepath_or_buffer does accept that).
Installed Versions