observablehq / stdlib

The Observable standard library.
https://observablehq.com/@observablehq/standard-library
ISC License
957 stars 83 forks source link

Support Apache Arrow tables in database clients #320

Open domoritz opened 1 year ago

domoritz commented 1 year ago

https://github.com/observablehq/stdlib/blob/6058924f39cb437cf627e5621d493846ebcf6ec7/src/duckdb.js#L58 introduces a copy that may not be needed. As soon as Arrow is supported as an output format, it would be good to remove this call.

mbostock commented 1 year ago

Retitled this issue to describe the more generic problem: we want to support Apache Arrow tables as a tabular data representation throughout database clients, SQL cells, and data table cells.

domoritz commented 1 year ago

https://github.com/apache/arrow/pull/34939 adds an indexed access proxy for Arrow but the performance isn't great compared to properly adopting Arrow. It would be great to have Arrow support throughout the different clients and cells.

domoritz commented 3 months ago

Now that Arrow is used in a lot more places, I think it may be a good time to revisit this issue. The extra copies are introducing extra overhead in many places and I think it would be super awesome if we could just pass Arrow columns directly into Plot (https://github.com/observablehq/plot/issues/191) without it making extra copies.

mbostock commented 3 months ago

FWIW, Framework’s DuckDBClient (as of 1.3) returns Apache Arrow tables without materializing array-of-objects. So there’s that.

domoritz commented 3 months ago

Oh nice. I guess you can't just remove the toArray call here for backwards compatibility?

How good is Arrow/columnar data support in Plot these days?

mbostock commented 3 months ago

That’s correct, it wouldn’t be backwards-compatible so I don’t think we are likely to change the behavior in Observable notebooks any time soon. (But eventually we’ll have a way to version control the Observable standard library, and port improvements from Observable Framework back to notebooks.)

Plot uses columnar data internally, so I would rate support as excellent, but we don’t yet have the shorthand syntax so it’s cumbersome to avoid materializing the array-of-objects — you have to pass the column vectors in yourself for each channel. https://github.com/observablehq/plot/issues/191 covers making the syntax more convenient.