Closed jonmmease closed 7 months ago
We should be able to test this locally and on CI with minio, which is available from conda-forge as minio-server
Partial implementation in progress in https://github.com/hex-inc/vegafusion/pull/417. This does not include the duckdb connection support discussed above
Would be cool to support polars too via scan_parquet
Polars support is possible, but would be a pretty big project. I'm hoping we'll eventually get support for Ibis, which can wrap Polars along with a bunch of other backends.
Duckdb parquet + s3 support added in 1.5.0
VegaFusion could support loading files from s3 compatible object storage.
The DataFusion connection would use the
object_store
crate as in https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html and https://github.com/apache/arrow-datafusion/blob/main/datafusion-examples/examples/query-http-csv.rs.The DuckDb connection would use the httpfs extension as in https://duckdb.org/docs/guides/import/s3_import.html.
In both cases, AWS credentials would be loaded from the standard environment variables, and the Vega spec could contain s3 URLs like
s3://<bucket>/<path>
.This would work for VegaFusion server as well as VegaFusion Python.
In terms of implementation, the
Connection.scan_*
methods would need special handling for s3 urls somewhere around here forscan_csv
.https://github.com/hex-inc/vegafusion/blob/6d352b78df1a2fca4db0b9a29aae2e5283df9a43/vegafusion-sql/src/connection/datafusion_conn.rs#L144
Cross reference https://github.com/hex-inc/vegafusion/issues/87 for adding
scan_parquet
support, which is particularly well suited for storage on s3.