Open Vectorrent opened 6 days ago
@Vectorrent I'm unable to reproduce this:
With this test case:
// issue129.test.ts
import { readFileSync } from "fs";
import { readParquet, wasmMemory } from "parquet-wasm";
import { describe, it, expect } from "vitest";
import * as arrow from "apache-arrow";
import * as wasm from "rust-arrow-ffi";
import { parseTable } from "../src";
wasm.setPanicHook();
describe("issue 129", (t) => {
const buffer = readFileSync("0320.parquet");
const ffiTable = readParquet(buffer).intoFFI();
const memory = wasmMemory();
const table = parseTable(
memory.buffer,
ffiTable.arrayAddrs(),
ffiTable.schemaAddr()
);
ffiTable.free();
console.log(table.schema);
it("Should pass", () => {
expect(true).toBeTruthy();
});
});
Schema {
fields: [
Field {
name: 'content',
type: [Utf8],
nullable: true,
metadata: Map(0) {}
},
Field {
name: 'url',
type: [Utf8],
nullable: true,
metadata: Map(0) {}
},
Field {
name: 'timestamp',
type: [Timestamp_ [Timestamp]],
nullable: true,
metadata: Map(0) {}
},
Field {
name: 'dump',
type: [Utf8],
nullable: true,
metadata: Map(0) {}
},
Field {
name: 'segment',
type: [Utf8],
nullable: true,
metadata: Map(0) {}
},
Field {
name: 'image_urls',
type: [List],
nullable: true,
metadata: Map(0) {}
}
],
metadata: Map(1) {
'huggingface' => '{"info": {"features": {"content": {"dtype": "string", "_type": "Value"}, "url": {"dtype": "string", "_type": "Value"}, "timestamp": {"dtype": "timestamp[s]", "_type": "Value"}, "dump": {"dtype": "string", "_type": "Value"}, "segment": {"dtype": "string", "_type": "Value"}, "image_urls": {"feature": {"feature": {"dtype": "string", "_type": "Value"}, "_type": "Sequence"}, "_type": "Sequence"}}}}'
},
dictionaries: Map(0) {},
metadataVersion: 4
}
Strange. I tried your code (i.e. loading from disk), and that fails too. I upgraded to Node v22, and apache-arrow v17.0.0 - with no luck. Not sure what else to try; maybe it's an engine thing? I'm running on Linux.
Anyway, not a huge priority, since I do have a workaround. Just thought it was worth reporting.
Are you able to slice that data (i.e. take the first 5 rows) and save it as a Parquet file that also fails for you? Then we could check that data in to Git and add it as a test case to this repo.
It's good that reading from IPC works, but I do want to make sure that arrow-js-ffi is stable!
I sliced 5 rows with PyArrow, saved them to disk, then tried FFI again with the new file. No dice, it still fails.
Here's the sliced file: https://mega.nz/file/CRsFDJrC#3lRSoohQ1kohnqzX0O0TmVtjrsfgKRgj0KMLzxf2nU8
Ok, cool, thanks for making that file.
For reference, I find it much easier to zip a Parquet file and share that via github in the issue itself.
Oops, didn't realize zip files were supported here. See attached.
I tried to load a new Parquet table, using the same method I always use, but that method failed with the following error:
This error is thrown when trying to load the table with FFI, but does not happen when we use the original implementation.
Since I already found a workaround, this bug isn't a huge priority for me. But I thought you guys might want to know about it.
Here is some reproducible code:
Versions: