observablehq / plot

A concise API for exploratory data visualization implementing a layered grammar of graphics
https://observablehq.com/plot/
ISC License
4.36k stars 176 forks source link

columnar support for Arrow tables #2030

Closed Fil closed 2 months ago

Fil commented 7 months ago

Detect Arrow tables and use as much of the direct access to the columns as we can—first and foremost, by not materializing the data on mark.initialize, and by routing a string accessor to getChild.

We don't add apache-arrow as a dependency (which means detection is done with duck typing of the methods we use… we could reinforce this a bit if needed, but I think that's fine).

The story is a bit complicated in the group transform (and maybe other places?) because we're actually making a new output data which currently uses take and map to create a new dataset that "looks like" the original data array.

In the Arrow table case, we might want take to be a "filtered" table, but I don't think it exists (API reference). We return instead an array of Row objects (which are Proxy objects into the columns); it's probably the best memory-wise, even though I don't like the looks of it. Anyway they're easy to convert to regular objects by writing ({...d}).

For a more thorough investigation of the places where we assume that values are arrays, I ran all the unit tests by replacing the data by an Arrow table (in Mark and facet data). This resulted in 25 "changed" snapshots (see diff); all of them are, it seems, only due to dates that change during the conversion to arrow. None of them were crashes.

I still need to investigate why the dates are modified (I'm thinking that may be because Arrow coerces them to Date32<day> — nope, they are DateMillisecond).

closes #191

cc: @jheer

Fil commented 7 months ago

The issue with dates can be reduced to this (which is independent of Plot):

import * as Arrow from "apache-arrow";
const data = [{date: new Date(1950, 1, 2)}]
console.log(data, [...Arrow.tableFromJSON(data)].map(d=> ({...d})));

// [ { date: 1950-02-01T23:00:00.000Z } ] [ { date: 1950-03-23T16:02:47.296Z } ]

you can see that the date is off by 50 days. Am I missing something, should I open an issue on https://github.com/apache/arrow @domoritz?

version information:

apache-arrow@^15.0.2:
  resolved "https://registry.yarnpkg.com/apache-arrow/-/apache-arrow-15.0.2.tgz#d87c6447d64d6fab34aa70119362680b6617ce63"
jheer commented 7 months ago

Hmm, we’ve had success reading Date values in DuckDB transferred in Arrow format. So I’d be sure to check if this is an encoding problem (Date to Arrow) or a decoding issue (Arrow to Date) first. Some encoders in Arrow JS (eg for Decimal) are known to be broken. I sometimes have had to use DuckDB or pyarrow to generate Arrow bytes for testing.

On a related note, in Mosaic we special case Timestamp types as Arrow JS returns those as numbers; we then instantiate Date objects ourselves.

Fil commented 7 months ago

The internal data is like this:

 Data {
    type: DateMillisecond [Date] { typeId: 8, unit: 1 },
    children: [],
    dictionary: undefined,
    offset: 0,
    length: 1,
    _nullCount: 0,
    stride: 2,
    values: Int32Array(16) [
      -1314529784,
      -146,
      0,
     …

The date is encoded on the two first 32bit numbers, which I decode manually to (-146*(2**32)) - 1498374784 = -628563600000 which is my initial date.

So it's apparently the decoding that fails.

jheer commented 7 months ago

The date is encoded on the two first 32bit numbers, which I decode manually to (-146*(2**32)) - 1498374784 = -628563600000 which is my initial date.

So it's apparently the decoding that fails.

I think DuckDB produces either Date32 or Timestamp values, so I haven't tripped over this yet! Thanks for documenting it.

Fil commented 7 months ago

I've reported the Date issue at https://github.com/apache/arrow/issues/40718. I think it is orthogonal to this PR, since I can get the same error with Plot#main and an arrow table —though it does not explain all the 25 differences :(

Fil commented 7 months ago

Tested with https://github.com/apache/arrow/pull/40725 everything works smoothly (except for test "mark data parallel to facet data triggers a warning" which is not relevant). Thanks for the super quick fix @trxcllnt and @domoritz!

Fil commented 7 months ago

In the PR that fixes the Date bug (https://github.com/apache/arrow/pull/40725), @domoritz also changes the .get(i) value accessor into .at(i), which means that everywhere we use (array)[i] vs (arrow vector).get(i), we could now unify with (array or vector).at(i). This is probably the change that will help us the most.

An example of a (custom) data transform that breaks with arrow tables is here. It uses data.flatMap, which does not exist on the "fake array". We could add it easily.

(I don't think we need to rush this, we should probably wait for 40725 to land.)

domoritz commented 7 months ago

Yeah, compatibility with native arrays was my main goal with supporting at. I'd be supportive of adding map and flatMap.

The next arrow release is in April so we could try to get the change in there (releases are ~ every three months). However, you probably don't want to rely on the latest arrow library being used so having a stop gap until the new library is common makes sense.

Fil commented 2 months ago

continued in (and superseded by) #2115