kylebarron / parquet-wasm

Rust-based WebAssembly bindings to read and write Apache Parquet data
https://kylebarron.dev/parquet-wasm/
Apache License 2.0
482 stars 19 forks source link

No functioning example #489

Closed v4lue4dded closed 2 months ago

v4lue4dded commented 2 months ago

I just spend an entire day trying to get parquet-wasm to read a parquet file and console.log() the result and couldn't get it done. Admittedly I'm an python programmer and new to javascript.

However as far as I could tell none of the examples that are currently in the README.MD work out of the box.

This is very unfortunate, since this is a javascript library so it should be able to run a functioning example right in the GitHub pages of a repo. (Not necessarily this repo but just some example repo with some code that runs already).

Something similar to https://hyparam.github.io/hyparquet/ would go a long way make this library a lot more user friendly to people like me.

For now I will be giving up on this library since I can not get it to work in a reasonable amount of time.

kylebarron commented 2 months ago

These two Observable examples are online, reproducible examples: https://github.com/kylebarron/parquet-wasm#published-examples

v4lue4dded commented 2 months ago

Thank you for the reply :)

I did see the observable examples I did admittedly find the platform very clunky and very unintuitive to use and like it was hiding a lot of code from me.

I did by now figure out how to download of the code form the example. Though from what I can tell it is not simple javascript at work but instead some proprietary wrapper that I can't really replicate around the javascript.

//...

function _d3(require){return(
require("https://d3js.org/d3.v5.min.js")
)}

function _mapboxgl(require){return(
require("mapbox-gl@1.6.0/dist/mapbox-gl.js")
)}

function _arrow(require){return(
require("apache-arrow")
)}

function _deck(require){return(
require.alias({
  h3: {}
})("deck.gl@8.9/dist.min.js")
)}

function _deckgl(mapContainer,deck,mapboxgl)
{
  // This is an Observable hack: clear previously generated content
  mapContainer.innerHTML = "";

  return new deck.DeckGL({
    // The HTML container to render into
    container: mapContainer,
    map: mapboxgl,
    mapStyle:
      "https://basemaps.cartocdn.com/gl/positron-nolabels-gl-style/style.json",

    // Viewport settings
    initialViewState: {
      longitude: 0,
      latitude: 15,
      zoom: 1,
      pitch: 0,
      bearing: 0
    },
    controller: true
  });
}

export default function define(runtime, observer) {
  const main = runtime.module();
  function toString() { return this.url; }
  const fileAttachments = new Map([
    ["2019-01-01_performance_mobile_tiles_centroids_brotli@2.parquet", {url: new URL("./files/ad0a1f0e7e5cc8290068443d99bbd1307877e1ba631e30622bbd5fd8adca660d2644fe8181db5dbd8d41be0c2eae868304deeb0efc8690d373553dcb859bc767.bin", import.meta.url), mimeType: "application/octet-stream", toString}]
  ]);
  main.builtin("FileAttachment", runtime.fileAttachments(name => fileAttachments.get(name)));
  main.variable(observer()).define(["md"], _1);
  main.variable(observer()).define(["md"], _2);
  main.variable(observer()).define(["md"], _3);
  main.variable(observer()).define(["md"], _4);
  main.variable(observer()).define(["md"], _5);
  main.variable(observer()).define(["md"], _6);
  main.variable(observer()).define(["md"], _7);
  main.variable(observer()).define(["md"], _8);
  main.variable(observer("viewof form")).define("viewof form", ["Inputs"], _form);
  main.variable(observer("form")).define("form", ["Generators", "viewof form"], (G, _) => G.input(_));
  main.variable(observer("mapContainer")).define("mapContainer", ["html"], _mapContainer);
  main.variable(observer("metricMapping")).define("metricMapping", _metricMapping);
  main.variable(observer("readParquet")).define("readParquet", _readParquet);
  main.variable(observer("arrowTable")).define("arrowTable", ["parquetFile","readParquet","arrow"], _arrowTable);
  main.variable(observer("parquetFile")).define("parquetFile", ["FileAttachment"], _parquetFile);
  main.variable(observer("geometryColumn")).define("geometryColumn", ["arrowTable"], _geometryColumn);
  main.variable(observer("flatCoordinateArray")).define("flatCoordinateArray", ["geometryColumn"], _flatCoordinateArray);
  main.variable(observer("layer")).define("layer", ["arrowTable","flatCoordinateArray","colorAttribute","deck","deckgl"], _layer);
  main.variable(observer("colorAttribute")).define("colorAttribute", ["metricMapping","form","arrowTable","colorScale"], _colorAttribute);
  main.variable(observer("colorScale")).define("colorScale", ["d3","form"], _colorScale);
  main.variable(observer("d3")).define("d3", ["require"], _d3);
  main.variable(observer("mapboxgl")).define("mapboxgl", ["require"], _mapboxgl);
  main.variable(observer("arrow")).define("arrow", ["require"], _arrow);
  main.variable(observer("deck")).define("deck", ["require"], _deck);
  main.variable(observer("deckgl")).define("deckgl", ["mapContainer","deck","mapboxgl"], _deckgl);
  return main;
}

I'll probably try again next weekend to unwrap that code to see if I can get it working for my project.

Both examples do seem to use outdated version of the library though: https://observablehq.com/@bmschmidt/hello-parquet-wasm uses https://unpkg.com/parquet-wasm@0.1.1/web.js which seems like a very early version and https://observablehq.com/@kylebarron/geoparquet-on-the-web uses https://unpkg.com/parquet-wasm@0.4.0-beta.5/esm/arrow2.js which is no longer recommended since it is a 2 if I understand things correctly.

It would just have been very useful to a javascript beginner like me to have a very simple example on github pages that uses the currently recommended version of the library to simply read a complete parquet file (either a small example from the github repo or a drop in file) and displays the result on screen. That would be a lot easier for me to iterate from.

kylebarron commented 2 months ago

which is no longer recommended since it is a 2 if I understand things correctly

The arrow2 API is deprecated and won't receive updates, but it should still work. The API of the latest beta is very similar to the previous API though.

It would just have been very useful to a javascript beginner like me to have a very simple example on github pages that uses the currently recommended version of the library to simply read a complete parquet file (either a small example from the github repo or a drop in file) and displays the result on screen. That would be a lot easier for me to iterate from.

I agree that would be nice, but I don't have time to create a standalone example at this point. Contributions (from you or someone else) would be welcome.

I generally recommend that the easiest way to get started is to use the type hints on each function to guide the user for how to fetch data.

kylebarron commented 2 months ago

In case it's useful to you, I'm using this in production here: https://github.com/developmentseed/lonboard/blob/dca942da9b5bd40769068a76c45e76c9b1c9c49c/src/parquet.ts

kylebarron commented 2 months ago

I published 0.6.0, added new content to the README, and updated https://observablehq.com/@kylebarron/geoparquet-on-the-web to use parquet-wasm 0.6. Hopefully this is easier to follow

mbostock commented 2 months ago

This should work in vanilla JavaScript:

import initParquetWasm, {readParquet} from "https://cdn.jsdelivr.net/npm/parquet-wasm@0.6.0/+esm";

await initParquetWasm("https://cdn.jsdelivr.net/npm/parquet-wasm@0.6.0/esm/parquet_wasm_bg.wasm");

(Unfortunately the default path to parquet_wasm_bg.wasm doesn’t work when using /+esm because it resolves to the wrong directory. I think it’s possible that it would work if you used import.meta.resolve instead of new URL(…, import.meta.url), but I’m not sure whether jsDelivr will rewrite import.meta.resolve calls to fix the relative path when using /+esm.)

kylebarron commented 2 months ago

It does work for me (at least in Deno) with

import initParquetWasm, {readParquet} from "https://cdn.jsdelivr.net/npm/parquet-wasm@0.6.0/esm/parquet_wasm.js";
await initParquetWasm();

I don't know how if it's possible rewrite the import with +esm. I specifically enabled that path as a known entry point so that import "parquet-wasm/esm/parquet_wasm.js" would work both in an application and from a browser. https://github.com/kylebarron/parquet-wasm/blob/09bc32e9b0cc2a44fd55dc7990f594fbaa08988b/templates/package.json#L37-L40

I think it’s possible that it would work if you used import.meta.resolve instead of new URL(…, import.meta.url)

That part is auto-generated by wasm-bindgen, so it's not something easy for me to change.

mbostock commented 2 months ago

Yes, that would work too. The /+esm is nice because it bundles and minifies local imports, so the module publisher (you) typically doesn’t haven’t to build and publish the bundle — the CDN does it.

It also works if you do this:

import initParquetWasm, {readParquet} from "https://cdn.jsdelivr.net/npm/parquet-wasm@0.6.0/esm/+esm";

await initParquetWasm();

This uses your ./esm entry point, and because it’s in the same folder as the source file, the relative path to the .wasm file works.

I would consider using import.meta.resolve instead of import.meta.url though, as it’s the more semantic way of resolving a relative resource.

Also, I think you’ll want to add the .wasm to your exports map in the package.json because these files are part of your module’s public API and you expect people to load them.

kylebarron commented 2 months ago

Thanks for the tips!

Yes, that would work too. The /+esm is nice because it bundles and minifies local imports, so the module publisher (you) typically doesn’t haven’t to build and publish the bundle — the CDN does it.

Oh very cool. I probably should suggest that from the README.

I would consider using import.meta.resolve instead of import.meta.url though, as it’s the more semantic way of resolving a relative resource.

I see. That makes sense. MDN does say

you should use import.meta.resolve(moduleName) instead of new URL(moduleName, import.meta.url) for these use cases wherever possible

I'll make an issue in wasm-bindgen tomorrow.

Also, I think you’ll want to add the .wasm to your exports map in the package.json because these files are part of your module’s public API and you expect people to load them.

Thanks for pointing this out. I see duckdb-wasm does this too. https://github.com/duckdb/duckdb-wasm/blob/58fcb9a46b73eac1abb9b0dee9d7c46d1a84f628/packages/duckdb-wasm/package.json#L99-L101

v4lue4dded commented 2 months ago

In case it's useful to you, I'm using this in production here: https://github.com/developmentseed/lonboard/blob/dca942da9b5bd40769068a76c45e76c9b1c9c49c/src/parquet.ts

@kylebarron FYI: I had gotten it working a week ago with that code snippet sorry that I hadn't answerd yet!! Thanks for that!! Had to use the bundler webpack though which was a bit of a step for me. ^^

Do I understand it correctly that (https://github.com/kylebarron/parquet-wasm/issues/489#issuecomment-2068228673) means it would work without working with a bundler, just with a cdn.jsdelivr.net import? :)

That would be really cool!!

kylebarron commented 2 months ago

it would work without working with a bundler, just with a cdn.jsdelivr.net import? :)

Yes. But you need to ensure you manually initialize the wasm code, whereas with the bundler entry point the wasm should be initialized behind the scenes I think.

I made a PR to update the jsdelivr link in the readme, and made new issues for the other comments above. So I think this issue can be closed.