kylebarron / parquet-wasm

Rust-based WebAssembly bindings to read and write Apache Parquet data
https://kylebarron.dev/parquet-wasm/
Apache License 2.0
510 stars 19 forks source link

Can't write an Arrow table if it contains list #606

Open timspro opened 4 days ago

timspro commented 4 days ago

I'm expecting the following code to work but am getting an error "RuntimeError: unreachable" when running in Node.js v20.17.0, thrown by fromIPCStream().

import { tableFromArrays, tableToIPC } from "apache-arrow"
import { Table } from "parquet-wasm"

const table = tableFromArrays({
  column: [[1, 2], [3, 4]],
})
const ipc = tableToIPC(table, "stream")
Table.fromIPCStream(ipc)

I tried changing "stream" to "file" but that didn't work either with the error "Io error: failed to fill whole buffer".

I was able to get other examples working locally that didn't have a list (for example, column: [1, 2] and column: [{a: 1}, {a: 2}]).

It does work if using typed arrays: column: [new Int32Array([1, 2]), new Int32Array([3, 4])]. So, I do have a workaround. However, I originally wanted to write a list of structs with Int32 values and now will have to do a struct of typed arrays. Perhaps that is what is intended.

kylebarron commented 4 days ago

If you compile with --debug flag turned on, then you can see the actual Rust error, instead of just RuntimeError: unreachable.

With the test in https://github.com/kylebarron/parquet-wasm/pull/607, the error is:

stderr | tests/js/index.test.ts > should read IPC stream correctly
panicked at /Users/kyle/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-ipc-53.0.0/src/convert.rs:98:30:
called `Option::unwrap()` on a `None` value

So the rust code is panicking on this line: https://github.com/apache/arrow-rs/blob/5414f1d7c0683c64d69cf721a83c17d677c78a71/arrow-ipc/src/convert.rs#L98

If we load this data in pyarrow, we see:

In [1]: import pyarrow as pa

In [3]: pa.ipc.open_stream("data.arrows").read_all()
Out[3]:
pyarrow.Table
column: list<: double>
  child 0, : double
----
column: [[[1,2],[3,4]]]

So the list's inner field does not have a name set. I'm not sure if that's allowed by the spec (it's rare at least). Either the JS IPC writer or the Rust IPC reader is incorrect.

kylebarron commented 4 days ago

I checked with @jorisvandenbossche and saw that the IPC spec doesn't require a name to be set, so this is an issue on the Rust side. (Though there should be a default name set)

kylebarron commented 4 days ago

Created https://github.com/apache/arrow-rs/issues/6415. Otherwise, you can work around this by manually setting a field name for any inner lists.

timspro commented 4 days ago

Thanks for the commentary. The type inference done be tableFromArrays() is passing the empty name: https://github.com/apache/arrow/blob/main/js/src/factories.ts#L153.

I was then able to get around the issue by passing in the List type directly:

import { Field, Int32, List, tableFromArrays, tableToIPC, vectorFromArray } from "apache-arrow"
import { Table } from "parquet-wasm"

const table = tableFromArrays({
  column: vectorFromArray(
    [[1, 2], [3, 4]],
    new List(new Field("_", new Int32())) // fails if "" passed instead
  ),
})
const ipc = tableToIPC(table, "stream")
Table.fromIPCStream(ipc)

This is a fine workaround for me.