pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.3k stars 1.96k forks source link

Issue reading S3 files #18907

Open stevenmanton opened 1 month ago

stevenmanton commented 1 month ago

Checks

Reproducible example

import os

import boto3
import pandas as pd
import polars as pl
import pyarrow.dataset as ds
import s3fs
from pyarrow.fs import S3FileSystem

os.environ["AWS_PROFILE"] = "develop"

uri = "s3://bucket/path/to/file.parquet"

# This line fails:
_ = pl.read_parquet(uri)

# However, these all pass:
s3 = s3fs.S3FileSystem()
s3.ls(uri)

_ = pd.read_parquet(uri)

dataset = ds.dataset(uri)

_ = pl.read_parquet(uri, use_pyarrow=True)

S3FileSystem().get_file_info(uri[5:])

boto3.client('s3').head_object(Bucket="bucket", Key="path/to/file.parquet")

Log output

Async thread count: 2

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[13], line 1
----> 1 _ = pl.read_parquet(uri)

File ~/.local/share/hatch/env/pip-compile/amzn-product-dna-science-devel/t-spEv9X/antonstv/lib/python3.12/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File ~/.local/share/hatch/env/pip-compile/amzn-product-dna-science-devel/t-spEv9X/antonstv/lib/python3.12/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File ~/.local/share/hatch/env/pip-compile/amzn-product-dna-science-devel/t-spEv9X/antonstv/lib/python3.12/site-packages/polars/io/parquet/functions.py:209, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    206     else:
    207         lf = lf.select(columns)
--> 209 return lf.collect()

File ~/.local/share/hatch/env/pip-compile/amzn-product-dna-science-devel/t-spEv9X/antonstv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2033, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2031 # Only for testing purposes
   2032 callback = _kwargs.get("post_opt_callback", callback)
-> 2033 return wrap_df(ldf.collect(callback))

ComputeError: Generic S3 error: Client error with status 403 Forbidden: No Body

Issue description

I'm unable to load files from S3 in certain environments. The issue seems related to using named AWS profiles. Other tools (e.g., boto3, pyarrow, s3fs), however, don't have this issue. Perhaps the internal Rust implementation that handles the AWS access doesn't pick up the environment variable? (Though the documentation states: "Polars looks for these as environment variable")

Expected behavior

The parquet file should load seamlessly from S3.

Installed versions

``` --------Version info--------- Polars: 1.8.1 Index type: UInt32 Platform: Linux-5.10.225-191.878.amzn2int.x86_64-x86_64-with-glibc2.26 Python: 3.12.3 (main, Apr 15 2024, 18:01:35) [Clang 17.0.6 ] ----Optional dependencies---- adbc_driver_manager altair cloudpickle 2.2.1 connectorx deltalake fastexcel fsspec 2024.6.1 gevent great_tables matplotlib 3.9.2 nest_asyncio 1.6.0 numpy 1.26.4 openpyxl pandas 2.1.4 pyarrow 15.0.2 pydantic 2.9.2 pyiceberg sqlalchemy 2.0.35 torch 2.4.1+cu121 xlsx2csv xlsxwriter ```
avimallu commented 1 month ago

As mentioned in this issue, you'll need to raise this issue as an FR for the object_store Rust package, as Polars likely has limited control over its functionality.

Also, your code seems to be saying that Pandas fails to load the S3 URI. 🤔

# This line fails:
_ = pd.read_parquet(uri)
tustvold commented 1 month ago

I've filed https://github.com/pola-rs/polars/issues/18979 to expose the necessary functionality in polars to allow you to resolve this