Open alexkreidler opened 3 weeks ago
Right now Mosaic assumes you first create a named table or view (as an exec call), which helps with reuse and ensures it is only loaded once. HTTPFS is of course supported for this.
It still seems like we are unnecessarily and incorrectly splitting the string at the . here because we assume the user meant to access a table in a schema. The workaround is to create a view but we can probably be a bit more robust here so it's not required. IIRC duckdb does cache the file reads.
It still seems like we are unnecessarily and incorrectly splitting the string at the . here because we assume the user meant to access a table in a schema. The workaround is to create a view but we can probably be a bit more robust here so it's not required. IIRC duckdb does cache the file reads.
It's not "incorrect", in the sense that the method's proper argument is a relation reference (possibly including database and schema names). It's not intended to take an arbitrary expression. However, this is certainly something we can revisit. When my schedule allows I'm hoping to make a major revision of the mosaic-sql helper library, at which point I should be able to reconsider table references more generally.
@jheer @domoritz I'll test out the view workaround, thank you!
@jheer I'm excited for those mosaic-sql updates, but I'm wondering if there's a way to make a smaller change for HTTPFS.
Right now Mosaic assumes you first create a named table or view (as an exec call), which helps with reuse and ensures it is only loaded once.
Would it break any other parts of Mosaic if mosaic-sql started generating queries that referenced HTTPFS URLs?
If not, would you be open to a PR to update Query.from
to detect https:// and http:// strings? It could put them into the list
and then query.from
arrays so those strings are outputted directly into SQL. This would introduce a slight dependence on DuckDB making #399 more difficult.
Another option could be to change mosaic-spec PlotData type so it supports SQL expressions similar to those in encoding channels using the sql
field (Expression.ts), so my read_parquet
example would get handled in the isQuery(e) || isSQLExpression(e)
branch of Query.from
.
https://github.com/uwdata/mosaic/blob/56756b02b2bbab0085e412120a96ded73cd54e46/packages/spec/src/spec/PlotFrom.ts#L10-L20
But the additional complexity might not be worth adding.
DuckDB's HTTPFS feature, which can read parquet, csv, json, and other files on HTTP servers or cloud object storage, is an incredibly powerful tool that allows the query engine to use range reads to push down queries on parquet (and use its builtin statistics) to limit the amount of data transferred over the network. This helps DuckDB run queries really quickly even over files that might be too large to load into DuckDB WASM's memory.
When I tried this spec in Mosaic Playground:
Mosaic created this query:
And when I changed it to remove the read_parquet function I got
It would be great to add some logic to detect
https://
andhttp://
strings (and maybes3://
andhf://
which are also supported by the httpfs extension) in the from field, and output them directly into the output SQL. https://github.com/uwdata/mosaic/blob/56756b02b2bbab0085e412120a96ded73cd54e46/packages/sql/src/Query.js#L158-L185And to add docs/examples for mosaic-sql, vgplot, and mosaic-spec.