pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.86k stars 1.92k forks source link

Examples of using `scan_csv` with cloud URIs #18201

Open bn-c opened 2 months ago

bn-c commented 2 months ago

Description

My current code:

import polars as pl

df1 = pl.scan_csv(f"s3://bucket/test.csv.gz")
df = df1.fetch()

Run environment: AWS Glue Python Shell
and running into error:

ComputeError: Generic S3 error: Client error with status 404 Not Found: <h1>404 Not Found</h1>No context found for request

Link

https://docs.pola.rs/api/python/stable/reference/api/polars.scan_csv.html

bn-c commented 2 months ago

Actually ,the error is resolved with https://github.com/pola-rs/polars/issues/11992#issuecomment-1779105501

import polars as pl
import boto3

session = boto3.session.Session()
credentials = session.get_credentials()

df1 = pl.scan_csv("s3://aws-glue-l1-211125705611-ap-northeast-1/test-out-pl.csv.gz", storage_options={
        "aws_access_key_id": credentials.access_key,
        "aws_secret_access_key": credentials.secret_key,
        "region": "ap-northeast-1",
        "session_token": credentials.token,
    })

df1.fetch()

It is a bit weird, how read_csv get the current session by default, and scan_csv doesn't. IMO: The case should be documented, or fixed to make read_csv & scan_csv more similar to eachother

ritchie46 commented 2 months ago

@nameexhaustion could we find the session by default? Is this something that's possible on the rust side?

nameexhaustion commented 2 months ago

@nameexhaustion could we find the session by default? Is this something that's possible on the rust side?

It should be possible, at least for S3, we can look into how it's sourced by the python libraries and do the same thing. I suspect it's probably just in a configuration file somewhere in the users home directory.