observablehq / feedback

Customer submitted bugs and feature requests
42 stars 3 forks source link

DuckDBClient.of does not create tables in database when Arrow tables are passed in parameter #623

Open keller-mark opened 3 months ago

keller-mark commented 3 months ago

Is your feature request related to a problem? Please describe.

Is DuckDBClient.of() intended to work with arrow Table objects?

The following code snippet returns an empty list of tables.

(await DuckDBClient.of({
    arrowTable: arrow.tableFromArrays({
      col1: [1, 2, 3],
      col2: ["one", "two", "three"],
    }),
  })).describeTables()

All examples i can find use FileAttachments. The stdlib source code seems to indicate this is possible but I cannot find examples or tests to reference.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Minimal reproducer: https://observablehq.com/d/e21c08e832074f40

mootari commented 3 months ago

The problem is that DuckDB loads its own instance of Arrow and then does instanceof checks against those symbols. If you load Arrow from a different URL it becomes a separate module and those checks will fail.

The workaround (for now) is to load the exact same module:

arrow = import('https://cdn.observableusercontent.com/npm/apache-arrow@11.0.0/+esm')
mbostock commented 3 months ago

For what it’s worth, this isn’t an issue with Observable Framework because we expressly override dependency resolution to ensure a consistent version of Apache Arrow. It would be better if DuckDB used duck testing instead of instanceof, though. (And c’mon, you’d think DuckDB would know to use “duck” testing… 🦆)

https://github.com/observablehq/framework/blob/84d3e5c3a4809d0062dabc815c94402eaef9c838/src/npm.ts#L161-L166

Though, there is a separate issue with Observable Framework which is that db.describeTables is currently broken because we’ve switched to returning Arrow tables from queries for performance. But I have a fix for that latter issue up at https://github.com/observablehq/framework/pull/1068.

keller-mark commented 3 months ago

Thanks for the info and the workaround! It seems the instanceof checks are happening within the Arrow source code (possibly one of these lines) and not the DuckDB source code (insertArrowTable source), so another workaround is to run arrow.tableToIPC (followed by conn.insertArrowFromIPCStream) using the same Arrow library instance that was used to run arrow.tableFromArrays (example in notebook). Since buffer is a Uint8Array, there are no instanceof issues (though at the same time, it could potentially be any Uint8Array).

mootari commented 1 month ago

Related: https://github.com/duckdb/duckdb-wasm/issues/1708#issuecomment-2065351615