kylebarron / parquet-wasm

Rust-based WebAssembly bindings to read and write Apache Parquet data
https://kylebarron.dev/parquet-wasm/
Apache License 2.0
482 stars 19 forks source link

Cannot read properties of undefined (reading 'fromIPCStream') #471

Closed MaTiAtSIE closed 3 months ago

MaTiAtSIE commented 3 months ago

Setup: Typescript == 4.9.5 node == 20.0.0 theia == 1.45.0 Webpack == 5.90.3 parquet-wasm == 0.5.0 apache-arrow == 15.0.0

I tried the example code:

const ipcStream = arrow.tableToIPC(rainfall, 'stream');
// the following line crashes
const wasmTable = parquetWASM.Table.fromIPCStream(ipcStream);

and get the error message Cannot read properties of undefined (reading 'fromIPCStream'). Inspecting the ipcStream at runtime reveals that ipcStream is an Uint8Array with 24400 entries.

My imports:

import * as arrow from 'apache-arrow';
import * as parquetWASM from 'parquet-wasm';

Do you have any idea of what is going wrong?

kylebarron commented 3 months ago

The API changed after 0.5.0. In 0.5 the Table object doesn't exist. You can look at the 0.5.0 README for an example https://github.com/kylebarron/parquet-wasm/tree/v0.5.0?tab=readme-ov-file#example

MaTiAtSIE commented 3 months ago

Thanks for this hint (it is really good that you answer so fast to the issues 👍). Now I have another Problem with the writeParquet function. The following lines make trouble

const uintArr = tableToIPC(rainfall, 'stream');
const parquetBuffer = writeParquet(
  uintArr, // this should be a table
  writerProperties
);

as writeParquet is expecting a Table:

Argument of type 'Uint8Array' is not assignable to parameter of type 'Table'.
  Type 'Uint8Array' is missing the following properties from type 'Table': free, recordBatch, toFFI, intoFFI, and 3 more.

So I tried

let writerProperties = new WriterPropertiesBuilder();
writerProperties = writerProperties.setCompression(Compression.ZSTD);
const props = writerProperties.build();
const uintArr = tableToIPC(rainfall, 'stream');
const arrTable = Table.fromIPCStream(uintArr);
const parquetBuffer = writeParquet(
    arrTable,
    props
);

and importing from 'parquet-wasm/node/arrow1' which compiles. However, this produces an empty schema. Therefore, the question is, how to call writeParquet from the return of tableToIPC(rainfall, 'stream')?

BTW: I changed the apache-arrow version to 13.0.0 as this version is also used in parquet-wasm 0.5.0

kylebarron commented 3 months ago

The best suggestion is to use the typescript types to guide you.

This is working for me with 0.5.0

import { tableFromArrays, tableToIPC } from "apache-arrow";
import * as parquet from "parquet-wasm/node/arrow1";
import { writeFileSync } from "fs";

// Create Arrow Table in JS
const LENGTH = 2000;
const rainAmounts = Float32Array.from({ length: LENGTH }, () =>
  Number((Math.random() * 20).toFixed(1))
);

const rainDates = Array.from(
  { length: LENGTH },
  (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i)
);

const rainfall = tableFromArrays({
  precipitation: rainAmounts,
  date: rainDates,
});

// Write Arrow Table to Parquet
const writerProperties = new parquet.WriterPropertiesBuilder()
  .setCompression(parquet.Compression.ZSTD)
  .build();
const arrowWasmTable = parquet.Table.fromIPCStream(
  tableToIPC(rainfall, "stream")
);
const parquetBuffer = parquet.writeParquet(arrowWasmTable, writerProperties);
writeFileSync("out.parquet", parquetBuffer);

I can verify that the file loads correctly in Python

image
MaTiAtSIE commented 3 months ago

Hello Kyle, thanks for your support and time :). Indeed, your code works, and it turned out that my code, which I posted earlier, works as well. However, the schema is empty when I inspect the table by setting a break point after calling 'tableFromIPC(readParquet(parquetBuffer));'. empty_schema

kylebarron commented 3 months ago

The entire table is empty. readParquet does not return a Uint8Array, it returns a Table object, so you need to call a method to convert that table to IPC bytes first. It should be something like tableFromIPC(readParquet().intoIPCStream()). The types will guide you

MaTiAtSIE commented 3 months ago

Perfect, 'tableFromIPC(readParquet(parquetBuffer).intoIPCStream())' worked.

MaTiAtSIE commented 3 months ago

BTW: Do you have any example to use 'readParquetStream'? Background: I have a huge parquet file (~500MB) and I only want to read, e.g., the first line or the schema (I don't know if the 'readParquetStream' is the right function for that).

If I do this with the stream:

readParquetStream('file:///C:/Users/marcel.tiator/Projekte/IDE/IDETest4/example.parquet').then((value) =>
{
    console.log('test');
});

I get the following runtime error:

2024-03-12T09:04:47.226Z root ERROR RuntimeError: unreachable
    at wasm://wasm/0132be12:wasm-function[2356]:0x3158b9
    at wasm://wasm/0132be12:wasm-function[4456]:0x393d5b
    at wasm://wasm/0132be12:wasm-function[3105]:0x35a45b
    at wasm://wasm/0132be12:wasm-function[90]:0xbde8c
    at wasm://wasm/0132be12:wasm-function[2045]:0x2e9c24
    at wasm://wasm/0132be12:wasm-function[4892]:0x39bbde
    at __wbg_adapter_28 (...)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
kylebarron commented 3 months ago

I think the right abstract is a class like ParquetFile from the pyarrow world in Python that only reads the metadata. We don't have something like that today, but it might come in in the next release