kylebarron / parquet-wasm

Rust-based WebAssembly bindings to read and write Apache Parquet data
https://kylebarron.dev/parquet-wasm/
Apache License 2.0
482 stars 19 forks source link

issue using esm or bundled versions of parquet-wasm using esbuild #488

Closed cornhundred closed 2 months ago

cornhundred commented 2 months ago

Hi, I am seeing errors when I try to import parquet-wasm using the bundler esbuild.

Similar to this issue, https://github.com/kylebarron/parquet-wasm/issues/486, I am seeing this error

94924d21-558e-468e-97d6-52e78f9ca56d:1552 Error loading data: TypeError: Cannot read properties of undefined (reading '__wbindgen_add_to_stack_pointer')
    at TP (94924d21-558e-468e-97d6-52e78f9ca56d:1552:48578)
    at c (94924d21-558e-468e-97d6-52e78f9ca56d:1552:50503)
    at b (94924d21-558e-468e-97d6-52e78f9ca56d:1552:52078)
    at async Object.DV [as render] (94924d21-558e-468e-97d6-52e78f9ca56d:1559:1493)
    at async widget.js:363:17

when I import parquet-wasm like this

import * as pq from "parquet-wasm/esm/arrow2";

without awaiting the default function. However, when I run await pq.default(), I see this error

widget.js:237 TypeError: Failed to construct 'URL': Invalid URL
    at UP (542aeba4-c71d-47a1-8e67-2dcff8e6ca23:1553:14046)
    at Object.oz [as render] (542aeba4-c71d-47a1-8e67-2dcff8e6ca23:1560:1458)
    at async widget.js:363:17

If I try to switch to using the bundler build like this

import * as pq from "parquet-wasm/bundler/arrow2"

(which required using the wasmLoader for esbuild and setting the target for esnext to enable top-level await) I get this error

widget.js:237 TypeError: Failed to construct 'URL': Invalid URL
    at LC (8a312991-81c6-426e-b14c-067bcbe5f62b:1553:10554)
    at 8a312991-81c6-426e-b14c-067bcbe5f62b:1553:10908

and clicking LC (8a312991-81c6-426e-b14c-067bcbe5f62b:1553:10554) shows

async function LC(j, A) {
    if (typeof j == "string") {
        j.startsWith("./") && (j = new URL(j,import.meta.url).href);
        let t = await fetch(j);
        if (typeof WebAssembly.instantiateStreaming == "function")
            try {
                return await WebAssembly.instantiateStreaming(t, A)
            } catch (e) {
                if (t.headers.get("Content-Type") != "application/wasm")
                    console.warn(e);
                else
                    throw e
            }
        j = await t.arrayBuffer()
    }
    return await WebAssembly.instantiate(j, A)
}

For some background, I'm using parquet-wasm in an anywidget that is being bundled with esbuild on the suggestion from this discussion. Also, the await pq.default() function works properly if use a CDN to obtain parquet-wasm like this

import * as pq from "https://unpkg.com/parquet-wasm@0.4.0-beta.5/esm/arrow2.js";

kylebarron commented 2 months ago

when I run await pq.default(), I see this error

widget.js:237 TypeError: Failed to construct 'URL': Invalid URL
    at UP (542aeba4-c71d-47a1-8e67-2dcff8e6ca23:1553:14046)
    at Object.oz [as render] (542aeba4-c71d-47a1-8e67-2dcff8e6ca23:1560:1458)
    at async widget.js:363:17

If you look at the generated bindings, you can see

    if (typeof input === 'undefined') {
        input = new URL('parquet_wasm_bg.wasm', import.meta.url);
    }

in the __wbg_init function exported at the very end of the file. Presumably, your import.meta.url is not set correctly, so that the new URL constructor fails.

and clicking LC (8a312991-81c6-426e-b14c-067bcbe5f62b:1553:10554) shows

async function LC(j, A) {
    if (typeof j == "string") {
        j.startsWith("./") && (j = new URL(j,import.meta.url).href);
        let t = await fetch(j);
        if (typeof WebAssembly.instantiateStreaming == "function")
            try {
                return await WebAssembly.instantiateStreaming(t, A)
            } catch (e) {
                if (t.headers.get("Content-Type") != "application/wasm")
                    console.warn(e);
                else
                    throw e
            }
        j = await t.arrayBuffer()
    }
    return await WebAssembly.instantiate(j, A)
}

I can't find this function in the generated bindings in the latest bundler build. You should try one the latest beta.

Also, the await pq.default() function works properly if use a CDN to obtain parquet-wasm like this

import * as pq from "https://unpkg.com/parquet-wasm@0.4.0-beta.5/esm/arrow2.js";

So presumably import.meta.url isn't defined in Jupyter or something like that.

cornhundred commented 2 months ago

Thanks @kylebarron, I was able to get it to work in the following way and would appreciate any advice:

I am using the 0.4.0-beta.5 version of parquet-wasm because I haven't migrated to the new API yet, so my dependencies in my package.json look like this:

"dependencies": {
    "deck.gl": "^9.0.5",
    "parquet-wasm": "0.4.0-beta.5",
    "apache-arrow": "15.0.2",
    "math.gl": "2.3.3",
    "@loaders.gl/core": "4.1.1"
},

Since parquet-wasm was working correctly with file that was obtained from unpkg, I figured I would download the file (https://unpkg.com/parquet-wasm@0.4.0-beta.5/esm/arrow2.js), save it locally to /vendor/parquet-wasm/parquet-wasm_unpkg.js (along with the project licenses), and import it like this:

import * as pq from "./vendor/parquet-wasm/parquet-wasm_unpkg.js";
...

I was still getting the URL error so I added a console log to the parquet-wasm_unpkg.js file to log the import.meta.url, which ends up being the localhost that is hosting Jupyter. On my MacBook I was able to use change the URL to this 'files/js/vendor/parquet-wasm/arrow2_bg.wasm' and it was able load the file and run without error - see below:

async function init(input) {
    // console.log('here in the parquet-wasm source code');

    // Use a fixed path for development. You may need to adjust this path based on your project's structure and where it's served from.
    // For example, if your server serves the `vendor` directory at the root, and `arrow2_bg.wasm` is within `vendor/parquet-wasm/`,
    // the path should reflect that.
    const fixedPath = 'files/js/vendor/parquet-wasm/arrow2_bg.wasm'; // Adjust this path as necessary.

    // js/vendor/parquet-wasm

    if (typeof input === 'undefined') {
        // Assume we're in a browser environment and construct the URL relative to the server's root.
        input = new URL(fixedPath, window.location.origin);
    }
    // console.log('WASM module will be loaded from:', input);

    const imports = getImports();

    if (typeof input === 'string' || (typeof Request === 'function' && input instanceof Request) || (typeof URL === 'function' && input instanceof URL)) {
        input = fetch(input);
    }

    initMemory(imports);

    const { instance, module } = await load(await input, imports);

    return finalizeInit(instance, module);
}

However, this did not work on Google Colab and Terra.bio - probably because we can't rely on Jupyter hosting files. So I figured I would try to hardwire the WASM file into the JavaScript by converting it to a Base64 string. I saved this string to a file called wasmModuleBase464.js that looks like this:

export const wasmBase64 = `AGFzbQEAAAAB5 ...

and imported it into the init function on my local copy of parquet-wasm_unpkg.js

import { wasmBase64 } from './wasmModuleBase64.js'; 

async function init(input) {
    // No need to adjust the path, as we'll be loading the WASM from a Base64 string
    const imports = getImports();

    // Decode the Base64 string to get the binary representation
    const binaryString = window.atob(wasmBase64);
    const bytes = new Uint8Array(binaryString.length);
    for (let i = 0; i < binaryString.length; i++) {
        bytes[i] = binaryString.charCodeAt(i);
    }

    initMemory(imports);

    // Use the binary bytes to instantiate the WebAssembly module
    const { instance, module } = await WebAssembly.instantiate(bytes, imports);

    return finalizeInit(instance, module);
}

This approach seems to be working locally and on Google Colab and Terra.bio. Do you think this is a reasonable approach? If so, would it make sense to include the WASM code as a base64 string in the esm version?