pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.54k stars 1.98k forks source link

`AWS_PROFILE` should be supported in cloud storage I/O config #18757

Open hutch3232 opened 2 months ago

hutch3232 commented 2 months ago

Description

I have a variety of different AWS/S3 profiles in my ~/.aws/credentials and ~/.aws/config files. I'd like to be able to either explicitly pass profile into storage_options or implicitly by setting an AWS_PROFILE environmental variable so that I can be sure to use the appropriate bucket keys/endpoint/and other configs.

I saw here that profile is not listed as a supported option: https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html

polars seems to use the first profile listed in those ~/.aws files, even if the profile name is not 'default'. By ensuring the relevant profile was listed first, pl.read_parquet("s3://my-bucket/my-parquet/*.parquet") would work, but being order-dependent is confusing and not scalable.

import polars as pl

pl.read_parquet("s3://my-bucket/my-parquet/*.parquet",
                storage_options={"profile": "my-profile"})

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 pl.read_parquet("s3://my-bucket/my-parquet/*.parquet",
      2                 storage_options={"profile": "my-profile"})

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:184, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    181     source = [io.BytesIO(s) for s in source]  # type: ignore[arg-type, assignment]
    183 # For other inputs, defer to `scan_parquet`
--> 184 lf = scan_parquet(
    185     source,  # type: ignore[arg-type]
    186     n_rows=n_rows,
    187     row_index_name=row_index_name,
    188     row_index_offset=row_index_offset,
    189     parallel=parallel,
    190     use_statistics=use_statistics,
    191     hive_partitioning=hive_partitioning,
    192     hive_schema=hive_schema,
    193     try_parse_hive_dates=try_parse_hive_dates,
    194     rechunk=rechunk,
    195     low_memory=low_memory,
    196     cache=False,
    197     storage_options=storage_options,
    198     retries=retries,
    199     glob=glob,
    200     include_file_paths=None,
    201 )
    203 if columns is not None:
    204     if is_int_sequence(columns):

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/_utils/deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:425, in scan_parquet(source, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, cache, storage_options, retries, include_file_paths)
    420 elif is_path_or_str_sequence(source):
    421     source = [
    422         normalize_filepath(source, check_not_directory=False) for source in source
    423     ]
--> 425 return _scan_parquet_impl(
    426     source,  # type: ignore[arg-type]
    427     n_rows=n_rows,
    428     cache=cache,
    429     parallel=parallel,
    430     rechunk=rechunk,
    431     row_index_name=row_index_name,
    432     row_index_offset=row_index_offset,
    433     storage_options=storage_options,
    434     low_memory=low_memory,
    435     use_statistics=use_statistics,
    436     hive_partitioning=hive_partitioning,
    437     hive_schema=hive_schema,
    438     try_parse_hive_dates=try_parse_hive_dates,
    439     retries=retries,
    440     glob=glob,
    441     include_file_paths=include_file_paths,
    442 )

File /opt/conda/lib/python3.9/site-packages/polars/io/parquet/functions.py:476, in _scan_parquet_impl(source, n_rows, cache, parallel, rechunk, row_index_name, row_index_offset, storage_options, low_memory, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, retries, include_file_paths)
    472 else:
    473     # Handle empty dict input
    474     storage_options = None
--> 476 pylf = PyLazyFrame.new_from_parquet(
    477     source,
    478     sources,
    479     n_rows,
    480     cache,
    481     parallel,
    482     rechunk,
    483     parse_row_index_args(row_index_name, row_index_offset),
    484     low_memory,
    485     cloud_options=storage_options,
    486     use_statistics=use_statistics,
    487     hive_partitioning=hive_partitioning,
    488     hive_schema=hive_schema,
    489     try_parse_hive_dates=try_parse_hive_dates,
    490     retries=retries,
    491     glob=glob,
    492     include_file_paths=include_file_paths,
    493 )
    494 return wrap_ldf(pylf)

ComputeError: unknown configuration key: profile

FWIW this functionality exists in pandas and I'm hoping to migrate code to polars, but this is kind of essential.

avimallu commented 2 months ago

I doubt Polars has control over object_store feature additions. I suggest you raise this request in their repo.

hutch3232 commented 2 months ago

Oh, I somehow didn't realize they were separate libraries. Looks like it used to be experimentally supported but that support was dropped. Bummer.

https://github.com/apache/arrow-rs/pull/4238 https://github.com/apache/arrow-rs/issues/4556

stevenmanton commented 2 months ago

Yikes. It looks like there's no easy way to get support for AWS profiles in polars, then. That's a big lack of functionality on the object_store package. My only workaround, then, is pl.read_parquet(..., use_pyarrow=True).

tustvold commented 1 month ago

:wave: object_store maintainer here. The major challenge with supporting AWS_PROFILE is the sheer scope of such an initiative, even the official Rust AWS SDK continues to have issues in this space (https://github.com/awslabs/aws-sdk-rust/issues/1193). Whilst we did at one point support AWS_PROFILE in object_store, it was tacked on and lead to surprising inconsistencies for users as only some of the configuration would be respected. We do not use SDKs as this allows for a more consistent experience across stores, especially since AWS is the only official one, along with a significantly smaller dependency footprint. There is more information on https://github.com/apache/arrow-rs/issues/2176.

This support for AWS_PROFILE was therefore removed and replaced with a more flexible API allowing users and system integrators to configure how to source credentials from their environment. I have filed https://github.com/pola-rs/polars/issues/18979 to suggest exposing this in polars.

Edit: As an aside I would strongly encourage using aws-vault to generate session credentials, as not only would it avoid this class of issue, but avoids storing credentials in plain text on the filesystem and relying on individual apps/tools to use the correct profile.

hutch3232 commented 1 month ago

One interesting thing I just realized is that pl.read_csv actually accepts the "profile" input to storage_options. That's surprising considering pl.read_parquet does not.

Edit: tested polars 1.8.2 Edit2: in fact, pl.read_csv can pick up AWS_PROFILE and even AWS_ENDPOINT_URL (see: #18758)