pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
29.33k stars 1.86k forks source link

`AWS_PROFILE` should be supported in cloud storage I/O config #18757

Open hutch3232 opened 4 days ago

hutch3232 commented 4 days ago


I have a variety of different AWS/S3 profiles in my ~/.aws/credentials and ~/.aws/config files. I'd like to be able to either explicitly pass profile into storage_options or implicitly by setting an AWS_PROFILE environmental variable so that I can be sure to use the appropriate bucket keys/endpoint/and other configs.

I saw here that profile is not listed as a supported option: https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html

polars seems to use the first profile listed in those ~/.aws files, even if the profile name is not 'default'. By ensuring the relevant profile was listed first, pl.read_parquet("s3://my-bucket/my-parquet/*.parquet") would work, but being order-dependent is confusing and not scalable.

import polars as pl

                storage_options={"profile": "my-profile"})

ComputeError                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 pl.read_parquet("s3://my-bucket/my-parquet/*.parquet",
      2                 storage_options={"profile": "my-profile"})

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:184, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    181     source = [io.BytesIO(s) for s in source]  # type: ignore[arg-type, assignment]
    183 # For other inputs, defer to `scan_parquet`
--> 184 lf = scan_parquet(
    185     source,  # type: ignore[arg-type]
    186     n_rows=n_rows,
    187     row_index_name=row_index_name,
    188     row_index_offset=row_index_offset,
    189     parallel=parallel,
    190     use_statistics=use_statistics,
    191     hive_partitioning=hive_partitioning,
    192     hive_schema=hive_schema,
    193     try_parse_hive_dates=try_parse_hive_dates,
    194     rechunk=rechunk,
    195     low_memory=low_memory,
    196     cache=False,
    197     storage_options=storage_options,
    198     retries=retries,
    199     glob=glob,
    200     include_file_paths=None,
    201 )
    203 if columns is not None:
    204     if is_int_sequence(columns):

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:425, in scan_parquet(source, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, cache, storage_options, retries, include_file_paths)
    420 elif is_path_or_str_sequence(source):
    421     source = [
    422         normalize_filepath(source, check_not_directory=False) for source in source
    423     ]
--> 425 return _scan_parquet_impl(
    426     source,  # type: ignore[arg-type]
    427     n_rows=n_rows,
    428     cache=cache,
    429     parallel=parallel,
    430     rechunk=rechunk,
    431     row_index_name=row_index_name,
    432     row_index_offset=row_index_offset,
    433     storage_options=storage_options,
    434     low_memory=low_memory,
    435     use_statistics=use_statistics,
    436     hive_partitioning=hive_partitioning,
    437     hive_schema=hive_schema,
    438     try_parse_hive_dates=try_parse_hive_dates,
    439     retries=retries,
    440     glob=glob,
    441     include_file_paths=include_file_paths,
    442 )

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:476, in _scan_parquet_impl(source, n_rows, cache, parallel, rechunk, row_index_name, row_index_offset, storage_options, low_memory, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, retries, include_file_paths)
    472 else:
    473     # Handle empty dict input
    474     storage_options = None
--> 476 pylf = PyLazyFrame.new_from_parquet(
    477     source,
    478     sources,
    479     n_rows,
    480     cache,
    481     parallel,
    482     rechunk,
    483     parse_row_index_args(row_index_name, row_index_offset),
    484     low_memory,
    485     cloud_options=storage_options,
    486     use_statistics=use_statistics,
    487     hive_partitioning=hive_partitioning,
    488     hive_schema=hive_schema,
    489     try_parse_hive_dates=try_parse_hive_dates,
    490     retries=retries,
    491     glob=glob,
    492     include_file_paths=include_file_paths,
    493 )
    494 return wrap_ldf(pylf)

ComputeError: unknown configuration key: profile

FWIW this functionality exists in pandas and I'm hoping to migrate code to polars, but this is kind of essential.

avimallu commented 3 days ago

I doubt Polars has control over object_store feature additions. I suggest you raise this request in their repo.

hutch3232 commented 3 days ago

Oh, I somehow didn't realize they were separate libraries. Looks like it used to be experimentally supported but that support was dropped. Bummer.

https://github.com/apache/arrow-rs/pull/4238 https://github.com/apache/arrow-rs/issues/4556