uwdata / mosaic

An extensible framework for linking databases and interactive views.
https://idl.uw.edu/mosaic/
Other
862 stars 57 forks source link

Support remote/httpfs URLs in the `from` field #567

Open alexkreidler opened 3 weeks ago

alexkreidler commented 3 weeks ago

DuckDB's HTTPFS feature, which can read parquet, csv, json, and other files on HTTP servers or cloud object storage, is an incredibly powerful tool that allows the query engine to use range reads to push down queries on parquet (and use its builtin statistics) to limit the amount of data transferred over the network. This helps DuckDB run queries really quickly even over files that might be too large to load into DuckDB WASM's memory.

When I tried this spec in Mosaic Playground:

{
    "plot": [
        {
            "mark": "lineY",
            "data": {
                "from": "read_parquet('https://f005.backblazeb2.com/file/alk-data/courtlistener/2024-10-27/opinion-clusters-2024-09-30.parquet')"
            },
            "x": "file",
            "y": "Close"
        }
    ],
    "width": 680,
    "height": 200
}

Mosaic created this query:

DESCRIBE SELECT "Date" AS "col0", "Close" AS "col1" FROM "read_parquet('https://f005/"."backblazeb2"."com/file/alk-data/courtlistener/2024-10-27/opinion-clusters-2024-09-30"."parquet')" AS "source"

And when I changed it to remove the read_parquet function I got

DESCRIBE SELECT "Date" AS "col0", "Close" AS "col1" FROM "https://f005/"."backblazeb2"."com/file/alk-data/courtlistener/2024-10-27/opinion-clusters-2024-09-30"."parquet" AS "source"

It would be great to add some logic to detect https:// and http:// strings (and maybe s3:// and hf:// which are also supported by the httpfs extension) in the from field, and output them directly into the output SQL. https://github.com/uwdata/mosaic/blob/56756b02b2bbab0085e412120a96ded73cd54e46/packages/sql/src/Query.js#L158-L185

And to add docs/examples for mosaic-sql, vgplot, and mosaic-spec.

jheer commented 3 weeks ago

Right now Mosaic assumes you first create a named table or view (as an exec call), which helps with reuse and ensures it is only loaded once. HTTPFS is of course supported for this.

domoritz commented 3 weeks ago

It still seems like we are unnecessarily and incorrectly splitting the string at the . here because we assume the user meant to access a table in a schema. The workaround is to create a view but we can probably be a bit more robust here so it's not required. IIRC duckdb does cache the file reads.

jheer commented 3 weeks ago

It still seems like we are unnecessarily and incorrectly splitting the string at the . here because we assume the user meant to access a table in a schema. The workaround is to create a view but we can probably be a bit more robust here so it's not required. IIRC duckdb does cache the file reads.

It's not "incorrect", in the sense that the method's proper argument is a relation reference (possibly including database and schema names). It's not intended to take an arbitrary expression. However, this is certainly something we can revisit. When my schedule allows I'm hoping to make a major revision of the mosaic-sql helper library, at which point I should be able to reconsider table references more generally.

alexkreidler commented 3 weeks ago

@jheer @domoritz I'll test out the view workaround, thank you!

@jheer I'm excited for those mosaic-sql updates, but I'm wondering if there's a way to make a smaller change for HTTPFS.

Right now Mosaic assumes you first create a named table or view (as an exec call), which helps with reuse and ensures it is only loaded once.

Would it break any other parts of Mosaic if mosaic-sql started generating queries that referenced HTTPFS URLs?

If not, would you be open to a PR to update Query.from to detect https:// and http:// strings? It could put them into the list and then query.from arrays so those strings are outputted directly into SQL. This would introduce a slight dependence on DuckDB making #399 more difficult.

Another option could be to change mosaic-spec PlotData type so it supports SQL expressions similar to those in encoding channels using the sql field (Expression.ts), so my read_parquet example would get handled in the isQuery(e) || isSQLExpression(e) branch of Query.from. https://github.com/uwdata/mosaic/blob/56756b02b2bbab0085e412120a96ded73cd54e46/packages/spec/src/spec/PlotFrom.ts#L10-L20 But the additional complexity might not be worth adding.