observablehq / stdlib

The Observable standard library.
https://observablehq.com/@observablehq/standard-library
ISC License
957 stars 83 forks source link

Update DuckDB #350

Closed domoritz closed 1 year ago

domoritz commented 1 year ago

Fixes #343

domoritz commented 1 year ago

@mbostock Can you merge this? There are some people who need some new features and the bug fixes.

mbostock commented 1 year ago

I tried to upgrade previously and I ran into backwards compatibility issues loading CSV/TSV files. I haven’t had a chance to test the new version to see that everything works. Is there anything we should look out for?

domoritz commented 1 year ago

Ahh, that's not good. There shouldn't be any regressions/API changes to those parts. Maybe you could take a look and let us know here or in https://github.com/duckdb/duckdb-wasm/issues.

domoritz commented 1 year ago

Updated to https://github.com/duckdb/duckdb-wasm/releases/tag/v1.22.0. Can you take another look? There are some cool new features like pivoting we would like to use.

mbostock commented 1 year ago

@domoritz I’m not seeing that it’s published to npm yet. Are you still publishing to npm, or only to GitHub?

https://www.npmjs.com/package/@duckdb/duckdb-wasm?activeTab=versions https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm/

The latest version on npm is still 1.21.0 which we can test.

tophtucker commented 1 year ago

On 1.21, CSV and JSON files still seem to work fine but TSV files don't show up.

Invalid Input Error: Error in file "walmarts.tsv": CSV options could not be auto-detected. Consider setting parser options manually. at Xo.insertCSVFromPath

Did something change where we have to set the delimiter? (I feel like I've seen we used to do that explicitly in the past.)

Another issue in DuckDBClient is that errors other than the "Could not convert" one fail silently:

https://github.com/observablehq/stdlib/blob/17220c91793fbe960fdcc01d420483dcd7e3d1d6/src/duckdb.js#L174-L187

tophtucker commented 1 year ago

Even if I explicitly set delimiter to "\t" in that insertCSVFromPath options object, it doesn't seem to work; I get the same error.

domoritz commented 1 year ago

Thanks for looking into it. Do you see the same issues in the cli client?

mbostock commented 1 year ago

Here is a minimal reproduction of the error: https://observablehq.com/d/9303cecdab5e669d

db = Error: Invalid Input Error: Error in file "us-state-capitals.tsv": CSV options could not be auto-detected. Consider setting parser options manually.
db = {
  const database = await createDuckDB();
  await database.open({query: {castTimestampToDate: true}});
  const url = await file.url();
  await database.registerFileURL(file.name, url);
  const connection = await database.connect();
  await connection.insertCSVFromPath(file.name, {name: file.name, schema: "main"});
  return database;
}
mbostock commented 1 year ago

I was able to get it to load like so:

await connection.insertCSVFromPath(file.name, {
  name: file.name,
  schema: "main",
  columns: {State: new Arrow.Utf8, Capital: new Arrow.Utf8},
  delimiter: "\t",
  detect: false,
  quote: '"',
  escape: '"',
  header: true
});

but clearly I don’t want to set detect: false; I still want to use automatic detection of column names and types when loading TSV.

Also, it took me some time debugging to infer that types should be specified as new Arrow.Utf8. The following alternatives I tried first did not work:

domoritz commented 1 year ago

This seems to be a regression. I filed https://github.com/duckdb/duckdb-wasm/issues/1166 and we can block on that. Thanks for the repro @mbostock.

Also, it took me some time debugging to infer that types should be specified as new Arrow.Utf8.

The type is defined in https://github.com/duckdb/duckdb-wasm/blob/47337ee580265fa52f5db7475b750ac5b2e51456/packages/duckdb-wasm/src/bindings/insert_options.ts#L33 and documented in https://shell.duckdb.org/docs/interfaces/index.CSVInsertOptions.html#columns.

Alternatively, you can write (although what you wrote with arrow types is better I think)

await connection.insertCSVFromPath(file.name, {
  name: file.name,
  schema: "main",
  columnsFlat: [
    { sqlType: "utf8", name: "State" },
    { sqlType: "utf8", name: "Capital" }
  ],
  delimiter: "\t",
  detect: false,
  quote: '"',
  escape: '"',
  header: true
});
mbostock commented 1 year ago

Yeah, I saw that the type was declared as a arrow.DataType, I just didn’t know how to make one of those. I guess I was surprised that you have to new a thing that I expected to be a singleton/static. Why is Arrow.Utf8 a class instead of a static object?

domoritz commented 1 year ago

Good question. @trxcllnt will know why types in Arrow are not singletons.

kimmolinna commented 1 year ago

Nowadays you have to specify DuckDBDataProtocol. A clip from Readme

await db.registerFileURL('remote.parquet', 'https://origin/remote.parquet', DuckDBDataProtocol.HTTP, false);

And you can skip the last boolean.

This one is working (4 equals DuckDBDataProtocol.HTTP):

await database.registerFileURL(file.name, url, 4);

So the problem isn't insertCSVFromPath. The defining how to registerFileURL has changed.

export enum DuckDBDataProtocol {
    BUFFER = 0,
    NODE_FS = 1,
    BROWSER_FILEREADER = 2,
    BROWSER_FSACCESS = 3,
    HTTP = 4,
    S3 = 5,
}
domoritz commented 1 year ago

Thank you @kimmolinna. I pushed a fix to this pull request.

mbostock commented 1 year ago

Thank you both. I have confirmed that this fixes the problem with TSV files.

The next change that I am seeing is that integer columns are now materialized as BigInt instead of number. Unfortunately that means that basic operations in JavaScript now throw errors. For example, if I run the following query in this notebook:

SELECT * FROM "us-state-capitals.tsv"

and store the results in capitals, and then I try the following JavaScript expression to compute the density:

capitals[0].Population / capitals[0].Area

I get

TypeError: Cannot mix BigInt and other types, use explicit conversions

Now, I can explicitly cast the column to ::INT, but the point is that the default BigInt representation is difficult to use in JavaScript—and also unnecessary here because none of the populations are greater than 2^53. Is there a way to disable BigInt as an automatically inferred type, or limit it to cases where it encounters a number greater than 2^53?

domoritz commented 1 year ago

It looks like the default parsing logic changed.

1.24

[
  {"column_name": "State", "column_type": "VARCHAR", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "Capital", "column_type": "VARCHAR", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "Since", "column_type": "DATE", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "Area", "column_type": "DOUBLE", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "Population", "column_type": "BIGINT", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "MSA/µSA", "column_type": "BIGINT", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "CSA", "column_type": "BIGINT", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "Rank", "column_type": "BIGINT", "null": "YES", "key": null, "default": null, "extra": null}
]

1.17

[
  {"column_name": "State", "column_type": "VARCHAR", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "Capital", "column_type": "VARCHAR", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "Since", "column_type": "DATE", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "Area", "column_type": "DOUBLE", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "Population", "column_type": "INTEGER", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "MSA/µSA", "column_type": "INTEGER", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "CSA", "column_type": "INTEGER", "null": "YES", "key": null, "default": null, "extra": null},
  {"column_name": "Rank", "column_type": "INTEGER", "null": "YES", "key": null, "default": null, "extra": null}
]

https://observablehq.com/d/9df8918a02ad22c6

I'll look into it.

domoritz commented 1 year ago

Looks like this issue started in https://github.com/duckdb/duckdb/pull/5757.

kimmolinna commented 1 year ago

How about the using of cast_bigint_to_double query option?

kimmolinna commented 1 year ago

The castBigIntToDouble: true works quite nicely. If it's false DataTable shows bigint as default which is nice and if you use true the DataTable shows Integer and there isn't any problem with capitals[0].Population / capitals[0].Areawhich equals 1242.3341677096369.

domoritz commented 1 year ago

I updated the pull request to use castBigIntToDouble until DuckDB adds support for parsing with a list of types.

domoritz commented 1 year ago

Thanks for the great repros. Really helped me find the core of the issues and confirm the fixes.

mbostock commented 1 year ago

Quick question if you have time: if we want to create multiple DuckDB databases, do you recommend creating a separate worker for each, or reusing the worker?

Currently we do something like this every time we need to create a database:

async function createDuckDB() {
  const duckdb = await import(…);
  const bundle = await duckdb.selectBundle({
    mvp: {
      mainModule: …,
      mainWorker: …
    },
    eh: {
      mainModule: …,
      mainWorker: …
    }
  });
  const logger = new duckdb.ConsoleLogger();
  const worker = await duckdb.createWorker(bundle.mainWorker);
  const db = new duckdb.AsyncDuckDB(logger, worker);
  await db.instantiate(bundle.mainModule);
  return db;
}

I’m wondering if it would be more performant to instead reuse the bundle, logger, and worker by having two steps. In pseudocode:

async function loadDuckDB() {
  const duckdb = await import(…);
  const bundle = await duckdb.selectBundle({
    mvp: {
      mainModule: …,
      mainWorker: …
    },
    eh: {
      mainModule: …,
      mainWorker: …
    }
  });
  const logger = new duckdb.ConsoleLogger();
  const worker = await duckdb.createWorker(bundle.mainWorker);
  return {duckdb, logger, worker, bundle};
}
async function createDuckDB() {
  const {duckdb, logger, worker, bundle} = await (duckdbPromise ??= loadDuckDB());
  const db = new duckdb.AsyncDuckDB(logger, worker);
  await db.instantiate(bundle.mainModule);
  return db;
}

Can you advise? Thanks!

Update: I tried this in the linked notebook (toggle the “shared worker” checkbox at the bottom) and it seems that the configuration options are no longer isolated when the worker is shared. Is that a bug? At any rate it looks like we should keep separate workers per database for the time being.

mbostock commented 1 year ago

Superseded by #356.

domoritz commented 1 year ago

I think sharing the bundle and logger is good but having a separate worker could be good since then you can query the databases in parallel. I don't know all the consequences but @ankoh will know.

ankoh commented 1 year ago

"Creating a worker" also means "instantiating the DuckDB-Wasm module" and "using DuckDB-Wasm with separate Memory" Depends on your isolation requirements, but sharing the worker will be more efficient (and also shares things like the buffer manager between the users).

trxcllnt commented 1 year ago

@mbostock Why is Arrow.Utf8 a class instead of a static object?

Because some of the data types do take params and need to be unique instances. So rather than have some types requiring construction and others not (i.e. {type: Utf8} vs. {type: new DenseUnion(typeCodes)}, we made them all constructors.

We can of course optimize the ones that don't store state under the hood to be static singleton instances, change them to top-level functions instead of constructors etc., but that was lower priority than getting the rest of the library implemented.