uwdata / mosaic

An extensible framework for linking databases and interactive views.
https://idl.uw.edu/mosaic
Other
755 stars 46 forks source link

Support of registered files (DuckDB-WASM) in vgplot #130

Closed spren9er closed 1 week ago

spren9er commented 1 year ago

Data loading helpers create DuckDB tables/views from parquet, JSON or CSV files. Unfortunately, this process can take a while when data is big. Sometimes — in DuckDB-WASM — it is nicer to register files and run queries directly on them (see registerFileURL). Query performance on DuckDB tables is optimal, but queries on parquet files are fast, too.

Here, we try to use a registered CSV file (via HTTP) in vgplot:

import * as vg from "@uwdata/vgplot";

import { DuckDBDataProtocol } from '@duckdb/duckdb-wasm';

const wasm = await vg.wasmConnector();

wasm.db.registerFileURL(
  'athletes.csv',
  'https://uwdata.github.io/mosaic-datasets/data/athletes.csv',
  DuckDBDataProtocol.HTTP,
  true,
);

vg.coordinator().databaseConnector(wasm);

const $query = vg.Selection.intersect();

export default vg.vconcat(
  vg.menu({
    label: 'Sport',
    as: $query,
    from: 'athletes.csv',
    column: 'sport',
    value: 'aquatics'
  }),
  vg.vspace(10),
  vg.plot(
    vg.barX(vg.from('athletes.csv', { filterBy: $query }), {
      x: vg.sum('gold'),
      y: 'nationality',
      fill: 'steelblue',
      sort: { y: '-x', limit: 10 },
    }),
    vg.xLabel('Gold Medals'),
    vg.yLabel('Nationality'),
    vg.yLabelAnchor('top'),
    vg.marginTop(15)
  )
);

Note, that references of mosaic clients point now to registered file athletes.csv, not a DuckDB table anymore.

This works! However, currently not all marks from vgplot are supported with registered files. Connected marks (e.g. area, line) don't work. When replacing above mark barX by

vg.lineY(vg.from('athletes.csv', { filterBy: $query }), {
  x: 'nationality',
  y: vg.sum('gold')
})

an error is raised:

Coordinator.js:145 Uncaught (in promise) Error: Catalog Error: Table with name csv does not exist!

I think the reason is probably due to some preprocessing of table name, see here. This logic conflicts with registered file name. I tried changing name of registered file athletes.csv, but if I omit .csv suffix, then DuckDB-WASM interprets reference as table.

It would be great if vgplot fully works with registered files (via HTTP(S), S3, local storage, etc.).

jheer commented 1 year ago

I won’t be able to take a closer look at this until next week (the issue is that Mosaic interprets the table name as schema.table), but in the meantime here are two ideas:

spren9er commented 1 year ago

Thanks, first workaround (creating a view for it) worked out.

domoritz commented 1 week ago

Since views work well for this. I think we also improved escaping of table names in the meantime. I'll close this for now but if you find other issues, we can reopen or log more specific issues.

domoritz commented 1 week ago

Quoting is in https://github.com/uwdata/mosaic/issues/378.