Closed MaTiAtSIE closed 3 months ago
The API changed after 0.5.0. In 0.5 the Table
object doesn't exist. You can look at the 0.5.0 README for an example https://github.com/kylebarron/parquet-wasm/tree/v0.5.0?tab=readme-ov-file#example
Thanks for this hint (it is really good that you answer so fast to the issues 👍). Now I have another Problem with the writeParquet function. The following lines make trouble
const uintArr = tableToIPC(rainfall, 'stream');
const parquetBuffer = writeParquet(
uintArr, // this should be a table
writerProperties
);
as writeParquet is expecting a Table:
Argument of type 'Uint8Array' is not assignable to parameter of type 'Table'.
Type 'Uint8Array' is missing the following properties from type 'Table': free, recordBatch, toFFI, intoFFI, and 3 more.
So I tried
let writerProperties = new WriterPropertiesBuilder();
writerProperties = writerProperties.setCompression(Compression.ZSTD);
const props = writerProperties.build();
const uintArr = tableToIPC(rainfall, 'stream');
const arrTable = Table.fromIPCStream(uintArr);
const parquetBuffer = writeParquet(
arrTable,
props
);
and importing from 'parquet-wasm/node/arrow1' which compiles. However, this produces an empty schema. Therefore, the question is, how to call writeParquet from the return of tableToIPC(rainfall, 'stream')?
BTW: I changed the apache-arrow version to 13.0.0 as this version is also used in parquet-wasm 0.5.0
The best suggestion is to use the typescript types to guide you.
This is working for me with 0.5.0
import { tableFromArrays, tableToIPC } from "apache-arrow";
import * as parquet from "parquet-wasm/node/arrow1";
import { writeFileSync } from "fs";
// Create Arrow Table in JS
const LENGTH = 2000;
const rainAmounts = Float32Array.from({ length: LENGTH }, () =>
Number((Math.random() * 20).toFixed(1))
);
const rainDates = Array.from(
{ length: LENGTH },
(_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i)
);
const rainfall = tableFromArrays({
precipitation: rainAmounts,
date: rainDates,
});
// Write Arrow Table to Parquet
const writerProperties = new parquet.WriterPropertiesBuilder()
.setCompression(parquet.Compression.ZSTD)
.build();
const arrowWasmTable = parquet.Table.fromIPCStream(
tableToIPC(rainfall, "stream")
);
const parquetBuffer = parquet.writeParquet(arrowWasmTable, writerProperties);
writeFileSync("out.parquet", parquetBuffer);
I can verify that the file loads correctly in Python
Hello Kyle, thanks for your support and time :). Indeed, your code works, and it turned out that my code, which I posted earlier, works as well. However, the schema is empty when I inspect the table by setting a break point after calling 'tableFromIPC(readParquet(parquetBuffer));'.
The entire table is empty. readParquet
does not return a Uint8Array
, it returns a Table
object, so you need to call a method to convert that table to IPC bytes first. It should be something like tableFromIPC(readParquet().intoIPCStream())
. The types will guide you
Perfect, 'tableFromIPC(readParquet(parquetBuffer).intoIPCStream())' worked.
BTW: Do you have any example to use 'readParquetStream'? Background: I have a huge parquet file (~500MB) and I only want to read, e.g., the first line or the schema (I don't know if the 'readParquetStream' is the right function for that).
If I do this with the stream:
readParquetStream('file:///C:/Users/marcel.tiator/Projekte/IDE/IDETest4/example.parquet').then((value) =>
{
console.log('test');
});
I get the following runtime error:
2024-03-12T09:04:47.226Z root ERROR RuntimeError: unreachable
at wasm://wasm/0132be12:wasm-function[2356]:0x3158b9
at wasm://wasm/0132be12:wasm-function[4456]:0x393d5b
at wasm://wasm/0132be12:wasm-function[3105]:0x35a45b
at wasm://wasm/0132be12:wasm-function[90]:0xbde8c
at wasm://wasm/0132be12:wasm-function[2045]:0x2e9c24
at wasm://wasm/0132be12:wasm-function[4892]:0x39bbde
at __wbg_adapter_28 (...)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
I think the right abstract is a class like ParquetFile
from the pyarrow world in Python that only reads the metadata. We don't have something like that today, but it might come in in the next release
Setup: Typescript == 4.9.5 node == 20.0.0 theia == 1.45.0 Webpack == 5.90.3 parquet-wasm == 0.5.0 apache-arrow == 15.0.0
I tried the example code:
and get the error message Cannot read properties of undefined (reading 'fromIPCStream'). Inspecting the ipcStream at runtime reveals that ipcStream is an Uint8Array with 24400 entries.
My imports:
Do you have any idea of what is going wrong?