pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.25k stars 1.95k forks source link

Reading from S3 compatible storage #18802

Closed robertdj closed 1 month ago

robertdj commented 1 month ago

Description

I'm trying to use Polars to access a parquet file stored in DigitalOcean Spaces, that is a S3 compatible storage. It works with the boto3 package, but I can't make it work with Polars.

I have set access_key_id and secret_access_key in ~/.aws/credentials. I can list contents in the bucket with boto3.

import polars as pl

import boto3

session = boto3.Session()
client = session.client(
    "s3",
    region_name="fra1",
    endpoint_url="https://fra1.digitaloceanspaces.com",
)

client.list_buckets()

Note that the endpoint_url is specified.

In the Spaces I have a bucket called mybucket containing a file called test.parquet. (Apparently the aws_region should be fixed to us-east-1 for DigitalOcean.)

storage_options = {
    "aws_access_key_id": aws_access_key_id,
    "aws_secret_access_key": aws_secret_access_key,
    "aws_region": "us-east-1",
}
source = "s3://mybucket/test.parquet"
pl.read_parquet(source)

I get an error

ComputeError: Generic S3 error: Client error with status 403 Forbidden: No Body

If I specify the bucket more elaborately to be

source = "s3://cache.fra1.digitaloceanspaces.com/mybucket/test.parquet"

I get a different error suggesting that the endpoint is hard coded to s3.amazonaws.com.

ComputeError: error sending request for url (https://fra1.digitaloceanspaces.com.s3.amazonaws.com/)
robertdj commented 1 month ago

It turns out that I can make this work if I use PyArrow:

import pyarrow.dataset as ds
import pyarrow.fs as fs

pyfs = fs.S3FileSystem(endpoint_override="https://fra1.digitaloceanspaces.com")
pyds = ds.dataset(source="mybucket/test.parquet", filesystem=pyfs, format="parquet")
df = pl.scan_pyarrow_dataset(pyds).collect()

But it would be nice if it worked directly with Polars :-)

Object905 commented 1 month ago

There should be aws_endpoint_url key in storage_options with your custom endpoint. Works for me.

robertdj commented 1 month ago

Works great, thanks!