kylebarron / arrow-js-ffi

Zero-copy reading of Arrow data from WebAssembly
https://www.npmjs.com/package/arrow-js-ffi
MIT License
106 stars 6 forks source link

[FFI] - RangeError: byte length of BigInt64Array should be a multiple of 8 #129

Open Vectorrent opened 6 days ago

Vectorrent commented 6 days ago

I tried to load a new Parquet table, using the same method I always use, but that method failed with the following error:

(venv) [crow@crow-pc ode]$ node misc/parquetFailing.js 
file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:300
            ? new dataType.ArrayType(copyBuffer(dataView.buffer, dataPtr, length * byteWidth))
              ^

RangeError: byte length of BigInt64Array should be a multiple of 8
    at new BigInt64Array (<anonymous>)
    at parseDataContent (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:300:15)
    at parseData (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:175:16)
    at parseData (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:139:23)
    at parseTable (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:935:28)
    at file:///home/crow/repos/ode/misc/parquetFailing.js:25:19
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Node.js v18.20.4

This error is thrown when trying to load the table with FFI, but does not happen when we use the original implementation.

Since I already found a workaround, this bug isn't a huge priority for me. But I thought you guys might want to know about it.

Here is some reproducible code:

import * as arrow from 'apache-arrow'
import { parseTable } from 'arrow-js-ffi'
import { wasmMemory, readParquet } from 'parquet-wasm'

const url =
    'https://huggingface.co/api/datasets/tiiuae/falcon-refinedweb/parquet/default/train/320.parquet'

// This one will succeed
;(async () => {
    const resp = await fetch(url)
    const buffer = new Uint8Array(await resp.arrayBuffer())
    const arrowWasmTable = readParquet(buffer)
    const table = arrow.tableFromIPC(arrowWasmTable.intoIPCStream())
    table.free()

    console.log('successfully loaded table via parquet-wasm')
})()

// This one will fail
;(async () => {
    const resp = await fetch(url)
    const buffer = new Uint8Array(await resp.arrayBuffer())
    const ffiTable = readParquet(buffer).intoFFI()

    const table = parseTable(
        wasmMemory().buffer,
        ffiTable.arrayAddrs(),
        ffiTable.schemaAddr()
    )
    table.free()

    console.log('successfully loaded table via FFI')
})()

Versions:

kylebarron commented 6 days ago

@Vectorrent I'm unable to reproduce this:

With this test case:

// issue129.test.ts
import { readFileSync } from "fs";
import { readParquet, wasmMemory } from "parquet-wasm";
import { describe, it, expect } from "vitest";
import * as arrow from "apache-arrow";
import * as wasm from "rust-arrow-ffi";
import { parseTable } from "../src";

wasm.setPanicHook();

describe("issue 129", (t) => {
  const buffer = readFileSync("0320.parquet");

  const ffiTable = readParquet(buffer).intoFFI();
  const memory = wasmMemory();

  const table = parseTable(
    memory.buffer,
    ffiTable.arrayAddrs(),
    ffiTable.schemaAddr()
  );
  ffiTable.free();

  console.log(table.schema);

  it("Should pass", () => {
    expect(true).toBeTruthy();
  });
});
Schema {
  fields: [
    Field {
      name: 'content',
      type: [Utf8],
      nullable: true,
      metadata: Map(0) {}
    },
    Field {
      name: 'url',
      type: [Utf8],
      nullable: true,
      metadata: Map(0) {}
    },
    Field {
      name: 'timestamp',
      type: [Timestamp_ [Timestamp]],
      nullable: true,
      metadata: Map(0) {}
    },
    Field {
      name: 'dump',
      type: [Utf8],
      nullable: true,
      metadata: Map(0) {}
    },
    Field {
      name: 'segment',
      type: [Utf8],
      nullable: true,
      metadata: Map(0) {}
    },
    Field {
      name: 'image_urls',
      type: [List],
      nullable: true,
      metadata: Map(0) {}
    }
  ],
  metadata: Map(1) {
    'huggingface' => '{"info": {"features": {"content": {"dtype": "string", "_type": "Value"}, "url": {"dtype": "string", "_type": "Value"}, "timestamp": {"dtype": "timestamp[s]", "_type": "Value"}, "dump": {"dtype": "string", "_type": "Value"}, "segment": {"dtype": "string", "_type": "Value"}, "image_urls": {"feature": {"feature": {"dtype": "string", "_type": "Value"}, "_type": "Sequence"}, "_type": "Sequence"}}}}'
  },
  dictionaries: Map(0) {},
  metadataVersion: 4
}
Vectorrent commented 6 days ago

Strange. I tried your code (i.e. loading from disk), and that fails too. I upgraded to Node v22, and apache-arrow v17.0.0 - with no luck. Not sure what else to try; maybe it's an engine thing? I'm running on Linux.

Anyway, not a huge priority, since I do have a workaround. Just thought it was worth reporting.

kylebarron commented 6 days ago

Are you able to slice that data (i.e. take the first 5 rows) and save it as a Parquet file that also fails for you? Then we could check that data in to Git and add it as a test case to this repo.

It's good that reading from IPC works, but I do want to make sure that arrow-js-ffi is stable!

Vectorrent commented 6 days ago

I sliced 5 rows with PyArrow, saved them to disk, then tried FFI again with the new file. No dice, it still fails.

Here's the sliced file: https://mega.nz/file/CRsFDJrC#3lRSoohQ1kohnqzX0O0TmVtjrsfgKRgj0KMLzxf2nU8

kylebarron commented 6 days ago

Ok, cool, thanks for making that file.

For reference, I find it much easier to zip a Parquet file and share that via github in the issue itself.

Vectorrent commented 6 days ago

0320.output.parquet.zip

Oops, didn't realize zip files were supported here. See attached.