pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.72k stars 17.93k forks source link

BUG: read_orc does not use the provided filesystem for all operations #58746

Open mjperrone opened 5 months ago

mjperrone commented 5 months ago

Pandas version checks

Reproducible Example

# first in one terminal, start a moto standalone server with `moto_server -p 5555`
import boto3
import os
import pandas
import pyarrow.fs

def test_pandas_read_orc():
    endpoint_port = f"5555"
    endpoint_uri = f"http://localhost:{endpoint_port}/"
    region = "us-east-1"
    os.environ["AWS_ACCESS_KEY_ID"] = "fake"
    os.environ["AWS_SECRET_ACCESS_KEY"] = "fake"
    os.environ["AWS_SECURITY_TOKEN"] = "fake"
    os.environ["AWS_SESSION_TOKEN"] = "fake"

    s3_resource = boto3.resource("s3", endpoint_url=endpoint_uri, region_name=region)
    bucket_name = "mybucket"
    s3_resource.Bucket(bucket_name).create()
    s3_resource.Bucket(bucket_name).upload_file(
        "userdata1.orc",
        "userdata1.orc",
    )
    filesystem = pyarrow.fs.S3FileSystem(endpoint_override=endpoint_uri, region=region)
    print(
        filesystem.get_file_info("mybucket/userdata1.orc")
    )  # outputs <FileInfo for 'mybucket/userdata1.orc': type=FileType.File, size=119367>,
    # proving the filesystem itself contacts the moto standalone server

    df = pandas.read_orc("s3://mybucket/userdata1.orc", filesystem=filesystem)
    # raises botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

test_pandas_read_orc()

Issue Description

Pandas does not respect the filesystem given to .read_orc() when getting a handle for the file. This means if you provide a mocked s3 filesystem backend, pandas will bypass that and try to contact the real s3 backend, making unit tests with a mocked s3 impossible, and potentially dangerous!

Here is a sample ORC file which I had next to the test file to upload to the mock s3 server (remove the .zip file extension as github doesn't support uploading .orc files, but this is in fact an ORC file as is) for retrieval: userdata1.orc.zip. Note that you can reproduce this with an invalid .orc file as the error happens before reading any ORC data.

Error produced:

Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/s3fs/core.py", line 113, in _error_wrapper
    return await func(*args, **kwargs)
  File ".venv/lib/python3.10/site-packages/aiobotocore/client.py", line 409, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_s3_pandas_min.py", line 36, in <module>
    test_pandas_read_orc()
  File "test_s3_pandas_min.py", line 23, in test_pandas_read_orc
    df = pandas.read_orc("s3://mybucket/userdata1.orc", filesystem=filesystem)
  File ".venv/lib/python3.10/site-packages/pandas/io/orc.py", line 109, in read_orc
    with get_handle(path, "rb", is_text=False) as handles:
  File ".venv/lib/python3.10/site-packages/pandas/io/common.py", line 730, in get_handle
    ioargs = _get_filepath_or_buffer(
  File ".venv/lib/python3.10/site-packages/pandas/io/common.py", line 443, in _get_filepath_or_buffer
    ).open()
  File ".venv/lib/python3.10/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
    ...

Expected Behavior

I would expect the .read_orc function to fully use the filesystem provided instead of trying to talk to the real s3, and succeed at reading the orc file.

My initial investigation

before the .read_table call happens, it is erroring at the get_handle() call with PermissionError('Forbidden') .get_handle() is not using the custom filesystem I provided, and read_table doesn't allow passing through storage_options (even though _get_filepath_or_buffer does accept that).

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.10.12.final.0 python-bits : 64 OS : Darwin OS-release : 22.6.0 Version : Darwin Kernel Version 22.6.0: Mon Feb 19 19:43:13 PST 2024; root:xnu-8796.141.3.704.6~1/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 68.1.2 pip : 24.0 Cython : None pytest : 7.4.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.7 jinja2 : 3.1.3 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : 2024.2.0 fsspec : 2024.3.1 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : 2024.3.1 scipy : None sqlalchemy : 2.0.28 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None None
mjperrone commented 5 months ago

tagging @mroeschke as he implemented filesystem for orc and parquet.

@jorisvandenbossche as author of initial implementation of parquet