pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.13k stars 1.83k forks source link

PanicException on reading Parquet file from S3 #17864

Open jonimatix opened 1 month ago

jonimatix commented 1 month ago

Checks

Reproducible example

parquet_file_name='s3://test-inputs-dataset/GAME_INFO/server_id=3345/date=2023-03-01/hour=06/GAME_INFO_3345_2023-03-01_06.parquet'

df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)
# or
df = pl.scan_parquet(parquet_file_name).collect()

# This works fine: df = pl.scan_parquet(parquet_file_name)

Log output

PanicException                            Traceback (most recent call last)
File c:\Projects\Proj\scripts\main.py:1
----> 1 df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False) # scan_parquet # n_rows=10,

File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\_utils\deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\_utils\deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\io\parquet\functions.py:206, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    203     else:
    204         lf = lf.select(columns)
--> 206 return lf.collect()
...
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

PanicException: called `Result::unwrap()` on an `Err` value: InvalidHeaderValue

Issue description

Running the read_parquet when using use_pyarrow=False raises PanicException error.

I noticed that the below works OK, when I add use_pyarrow=True, but it seems very slow :

df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=True)

The file that I am reading is stored in S3. The S3 file path (parquet_file_name) has no spaces in it. If I download the parquet file locally, and open the file from local disk, Polars does not raise any issues.

Also note, before I upgraded polars version to 1.21, I was using version 0.17 and read_parquet did not raise any issues!

Expected behavior

Dataframe read without errors

Installed versions

--------Version info--------- Polars: 1.2.1 Index type: UInt32 Platform: Windows-10-10.0.22621-SP0 Python: 3.11.9 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:40:41) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.6.1 gevent: great_tables: hvplot: matplotlib: 3.9.1 nest_asyncio: 1.6.0 numpy: 2.0.1 openpyxl: pandas: 2.2.2 pyarrow: 17.0.0 pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter:

deanm0000 commented 1 month ago

You should edit into the title that the issue is reading from cloud. Please run this and post full output, maybe that'll help.

import os
os.environ['RUST_TRACEBACK']='full'
import polars as pl
parquet_file_name=...
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)

The team treats all panic errors as bugs but it's likely that you need to set storage_options.

See here https://docs.pola.rs/api/python/stable/reference/api/polars.scan_parquet.html#polars-scan-parquet

and here

https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html

The environment variables that pyarrow via fsspec look for aren't 100% in sync with what polars via object_store looks for so that's probably how to fix the issue.

jonimatix commented 1 month ago

You should edit into the title that the issue is reading from cloud. Please run this and post full output, maybe that'll help.

import os
os.environ['RUST_TRACEBACK']='full'
import polars as pl
parquet_file_name=...
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)

The team treats all panic errors as bugs but it's likely that you need to set storage_options.

See here https://docs.pola.rs/api/python/stable/reference/api/polars.scan_parquet.html#polars-scan-parquet

and here

https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html

The environment variables that pyarrow via fsspec look for aren't 100% in sync with what polars via object_store looks for so that's probably how to fix the issue.

Updated as suggested

jonimatix commented 1 month ago

object_store

As per your suggestion to use storage_options:


# WORKS
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False, 
                     storage_options = {
                            "aws_access_key_id": os.environ.get('AWS_ACCESS_KEY_ID'),
                            "aws_secret_access_key": os.environ.get('AWS_SECRET_ACCESS_KEY'),
                            "aws_region": AWS_REGION})

# DOES NOT WORK 
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=True, 
                     storage_options = {
                            "aws_access_key_id": os.environ.get('AWS_ACCESS_KEY_ID'),
                            "aws_secret_access_key": os.environ.get('AWS_SECRET_ACCESS_KEY'),
                            "aws_region": AWS_REGION})

The error message is

TypeError: AioSession.init() got an unexpected keyword argument 'aws_access_key_id'

deanm0000 commented 1 month ago

The new error is because there isn't parity between what fsspec expects as key names and what object_store expects between key names. If you set your AWS_REGION as as an env var then does pl.read_parquet(parquet_file_name, columns=col_list,use_pyarrow=False) work? I don't use S3 so I'm just guessing at that.

From here it looks like you should set the environment variable AWS_DEFAULT_REGION to whatever it should be and then I think that pl.read_parquet(parquet_file_name, columns=col_list,use_pyarrow=False) would work.

Kuinox commented 1 week ago

If you want a minimal reproduction of the panic:

import polars as pl
pl.scan_parquet("s3://foobar/file.parquet").collect()