kylebarron / parquet-wasm

Rust-based WebAssembly bindings to read and write Apache Parquet data
https://kylebarron.dev/parquet-wasm/
Apache License 2.0
482 stars 19 forks source link

Example doesnt work #465

Closed vaguue closed 3 months ago

vaguue commented 4 months ago

I'm following the steps from README.md and getting this error

file:///Users/seva/seva/node_modules/parquet-wasm/esm/parquet_wasm.js:3695
            wasm.__wbindgen_add_to_stack_pointer(16);
                 ^

TypeError: Cannot read properties of undefined (reading '__wbindgen_add_to_stack_pointer')
    at Table.fromIPCStream (file:///Users/seva/seva/node_modules/parquet-wasm/esm/parquet_wasm.js:3695:18)
    at file:///Users/seva/seva/boosters/check.js:23:33
    at ModuleJob.run (node:internal/modules/esm/module_job:218:25)
    at async ModuleLoader.import (node:internal/modules/esm/loader:329:24)
    at async loadESM (node:internal/process/esm_loader:28:7)
    at async handleMainPromise (node:internal/modules/run_main:120:12)
fspoettel commented 4 months ago

seems related to #412

kylebarron commented 4 months ago

Can you say what you tried? Usually an error like

Cannot read properties of undefined (reading '__wbindgen_add_to_stack_pointer')

means that you didn't initialize the Wasm bundle. If you're using the esm endpoint, you need to await the default export, otherwise the Wasm bundle will never get initialized.

vaguue commented 4 months ago

Tried to run it with node v21.5.0 in esm mode (with "type": "module")

kylebarron commented 4 months ago

In esm mode, you always have to await the default export, or you'll get errors like above where the wasm wasn't instantiated

fspoettel commented 4 months ago

@kylebarron Would you accept a PR that updates the documentation? I also ran into this exact issue when integrating parquet-wasm into an ESM web worker (next.js). I think this would be very helpful given that e.g. also vite defaults to esm modules with v5.

vaguue commented 4 months ago

Isn't esm async out-of-box? I always thought that the whole meaning of esm is the possibility to export somewhat asynchronous, yet I have to do await somethingImported? Kinda counterintuitive

kylebarron commented 4 months ago

Would you accept a PR that updates the documentation?

Yes of course! PRs always welcome

Isn't esm async out-of-box?

It is but wasm initialization is a separate async step from just loading the code itself.

Ideally we can fix https://github.com/kylebarron/parquet-wasm/pull/414 and then publish an 0.6 release sometime soon, but I haven't had time to test that.

vaguue commented 3 months ago

Well, can't wait for this to happen, as of now I had to use apache arrow + node-addon-api, to it would be nice to have a stable API for working with parquets. What we gonna do with this issue?

kylebarron commented 3 months ago

In the documentation, it says

Note that when using the esm bundles, the default export must be awaited. See here for an example.

It's not clear to me what your issue is. You need to await the default export and then it'll work.

vaguue commented 3 months ago

well, can we just create an export in which we await this default export and reexport the actual module?

vaguue commented 3 months ago

So we can just import the module and be ready to go. Because generally this await init(); thing is kinda dubious for me.

kylebarron commented 3 months ago

well, can we just create an export in which we await this default export and reexport the actual module?

No, as far as I can tell that's not possible. And even if it were, I'd have to somehow modify the default JS binding that wasm-bindgen emits, which sounds horrible.

thing is kinda dubious for me

How is this dubious?

import initWasm, {readParquet} from 'parquet-wasm/esm/arrow1.js';
await initWasm();
readParquet(...);

FWIW sql.js has the same behavior, which they call initSqlJs, so I'm not alone.

A PR is welcome to improve the docs! But otherwise I'm going to close this because it's expected behavior.

vaguue commented 3 months ago

what about doing like myexport.js:

import initWasm, * as MyExports from 'parquet-wasm/esm/arrow1.js';
await initWasm();
export * from MyExports;
vaguue commented 3 months ago

what I mean is why not just create a wrapper around the default wasm-bindgen intricacies to make the usage more simple :) I don't know how wasm-bindgen guys see things, but in my opinion that's kinda against the ESM nature at all. Not that I see a possible case when someone imports the module but doesn't await for this init thing.

kylebarron commented 3 months ago

The wasm bundle is not fetched until the initWasm call. Therefore, separating it gives a lot more power to users. For example, you might only rarely fetch Parquet files from your app, and therefore wish to defer loading the wasm until the end user needs the functionality.

Additionally, you can pass a URL into initWasm to fetch the wasm from your own server, which can be necessary in some situations.

vaguue commented 3 months ago

Correct me if I'm wrong, but in this case one can just import the whole module asynchronously,i.e. await import(...). So you have this "power" even without the init step. But this step overcomplicates Node.js usage.

vaguue commented 3 months ago
import * as arrow from "apache-arrow";
import init, * as parquet from "parquet-wasm";

await init();

// Create Arrow Table in JS
const LENGTH = 2000;
const rainAmounts = Float32Array.from({ length: LENGTH }, () =>
  Number((Math.random() * 20).toFixed(1))
);

const rainDates = Array.from(
  { length: LENGTH },
  (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i)
);

const rainfall = arrow.tableFromArrays({
  precipitation: rainAmounts,
  date: rainDates,
});

// Write Arrow Table to Parquet

// wasmTable is an Arrow table in WebAssembly memory
const wasmTable = parquet.Table.fromIPCStream(arrow.tableToIPC(rainfall, "stream"));
const writerProperties = new parquet.WriterPropertiesBuilder()
  .setCompression(parquet.Compression.ZSTD)
  .build();
const parquetUint8Array = parquet.writeParquet(wasmTable, writerProperties);

// Read Parquet buffer back to Arrow Table
// arrowWasmTable is an Arrow table in WebAssembly memory
const arrowWasmTable = parquet.readParquet(parquetUint8Array);

// table is now an Arrow table in JS memory
const table = arrow.tableFromIPC(arrowWasmTable.intoIPCStream());
console.log(table.schema.toString());
// Schema<{ 0: precipitation: Float32, 1: date: Date64<MILLISECOND> }>
node:internal/deps/undici/undici:12442
    Error.captureStackTrace(err, this);
          ^

TypeError: fetch failed
    at node:internal/deps/undici/undici:12442:11
    at async __wbg_init (file:///Users/seva/seva/node_modules/parquet-wasm/esm/parquet_wasm.js:5238:51)
    at async file:///Users/seva/seva/boosters/check.js:4:1 {
  cause: Error: not implemented... yet...
      at makeNetworkError (node:internal/deps/undici/undici:5675:35)
      at schemeFetch (node:internal/deps/undici/undici:10563:34)
      at node:internal/deps/undici/undici:10440:26
      at mainFetch (node:internal/deps/undici/undici:10459:11)
      at fetching (node:internal/deps/undici/undici:10407:7)
      at fetch (node:internal/deps/undici/undici:10271:20)
      at Object.fetch (node:internal/deps/undici/undici:12441:10)
      at fetch (node:internal/process/pre_execution:336:27)
      at __wbg_init (file:///Users/seva/seva/node_modules/parquet-wasm/esm/parquet_wasm.js:5233:17)
      at file:///Users/seva/seva/boosters/check.js:4:7
}

Node.js v21.5.0

This is just terrible

kylebarron commented 3 months ago

If you're in node, use the node export