vega / vegafusion

Serverside scaling for Vega and Altair visualizations
https://vegafusion.io
BSD 3-Clause "New" or "Revised" License
303 stars 15 forks source link

Add support for loading files from s3 #416

Closed jonmmease closed 7 months ago

jonmmease commented 8 months ago

VegaFusion could support loading files from s3 compatible object storage.

The DataFusion connection would use the object_store crate as in https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html and https://github.com/apache/arrow-datafusion/blob/main/datafusion-examples/examples/query-http-csv.rs.

The DuckDb connection would use the httpfs extension as in https://duckdb.org/docs/guides/import/s3_import.html.

In both cases, AWS credentials would be loaded from the standard environment variables, and the Vega spec could contain s3 URLs like s3://<bucket>/<path>.

This would work for VegaFusion server as well as VegaFusion Python.

In terms of implementation, the Connection.scan_* methods would need special handling for s3 urls somewhere around here for scan_csv.

https://github.com/hex-inc/vegafusion/blob/6d352b78df1a2fca4db0b9a29aae2e5283df9a43/vegafusion-sql/src/connection/datafusion_conn.rs#L144

Cross reference https://github.com/hex-inc/vegafusion/issues/87 for adding scan_parquet support, which is particularly well suited for storage on s3.

jonmmease commented 8 months ago

We should be able to test this locally and on CI with minio, which is available from conda-forge as minio-server

jonmmease commented 8 months ago

Partial implementation in progress in https://github.com/hex-inc/vegafusion/pull/417. This does not include the duckdb connection support discussed above

kszlim commented 7 months ago

Would be cool to support polars too via scan_parquet

jonmmease commented 7 months ago

Polars support is possible, but would be a pretty big project. I'm hoping we'll eventually get support for Ibis, which can wrap Polars along with a bunch of other backends.

jonmmease commented 7 months ago

Duckdb parquet + s3 support added in 1.5.0