kylebarron / parquet-wasm

Rust-based WebAssembly bindings to read and write Apache Parquet data
https://kylebarron.dev/parquet-wasm/
Apache License 2.0
482 stars 19 forks source link

Write a valid geoparquet? #502

Closed bjyberg closed 2 months ago

bjyberg commented 2 months ago

Awesome package and thanks you for all of the examples of using it with Geoparquets on Observable - super useful! This is less of a bug/issue and more of a question. I am trying to figure out how to write a valid geoparquet with this package, and I am curious if you have any suggestions. Here is a quick example of what I am trying to do:

I've tested this in Observable with parquet-wasm versions: 0.4.0/arrow1, 0.5.0/arrow1, and 0.6.0-beta.3 (below example)

const wkb = geos.geosGeomToWKB(geomPtr) // returns a WKB buffer

const date = [new Date(Date.now())]

const temperature = new Float32Array([24.5])

const name = arrow.vectorFromArray(["test_region"], new arrow.Utf8())

const arrow_table = arrow.tableFromArrays({
    precipitation: rainAmounts,
    date: rainDates,
    name: name,
    geometry: arrow.vectorFromArray([wkb], new arrow.Binary())
  });

arrow_table.schema.metadata.set( // Issue here maybe??? 
    "geo",
   `{"version": "1.0.0", "primary_column": "geometry", ....... }`) // I won't include the full thing for brevity

const pqTable = await parquet.Table.fromIPCStream(
    arrow.tableToIPC(rainfall_arrowTable, "stream");

pqTable.schema.metadata().get("geo") // This returns the metadata

const parquetUint8Array = parquet.writeParquet(pqTable);

I am able to open and view the geo metadata key in pyarrow, but the 'geo' key is not recognized by GDAL, your geoparquet-wasm package (super excited for this btw!), GPQ from planet, etc. The error is always along the lines of - Error: General error: expected a 'geo' key in GeoParquet metadata. I'm wondering if the issue lies with how I am assigning the schema to the arrow table? If so, maybe it isn't a question for here... I definitely lack in-depth knowledge of the arrow, parquet and geoparquet formats, so I was hoping to get some expertise on your end, given your involvement with all of them! Is what I am attempting possible/worth doing? Or am I better off waiting for further development on the geoparquet-wasm side of things?

kylebarron commented 2 months ago

I think the issue here is that Arrow schema metadata and Parquet key-value file metadata are technically different concepts. And so I assume that the Rust parquet crate does not automatically write the table metadata onto the Parquet file.

Other libraries include Arrow table schema metadata onto the Parquet key-value metadata, so maybe we should do the same here when writing.

I am able to open and view the geo metadata key in pyarrow

Is this on the table schema or the parquet metadata. They're two different things.

The Parquet schema is accessible with pyarrow.parquet.read_metadata(...).metadata.get(b'geo') while the Arrow schema is stored separately in the Parquet file and is accessible with pyarrow.parquet.read_schema(...).metadata.get(b'geo'). I'm guessing only the latter one exists in the Parquet file you're writing.

bjyberg commented 2 months ago

Thanks for such a quick response! And nice guess, I just checked and you're absolutely correct - Only pyarrow.parquet.read_schema(...).metadata.get(b'geo') exists in the file I've written. So would the solution be to write the geo metadata to the parquet metadata rather than the arrow schema? Is that possible at the moment? Thanks again!

kylebarron commented 2 months ago

I think we just need to implement this method: https://github.com/kylebarron/parquet-wasm/blob/ef8ca3b296f21793330907c43498326242bd9443/src/writer_properties.rs#L145-L154

kylebarron commented 2 months ago

Can you test from this branch https://github.com/kylebarron/parquet-wasm/pull/503? There are developer docs here for building.

Usage should be something like:

import {
  WriterProperties,
  WriterPropertiesBuilder,
} from "./pkg/esm/parquet_wasm.js";

let props = new Map<string, string>();
props.set("geo", "...");
let writerProps = new WriterPropertiesBuilder()
  .setKeyValueMetadata(props)
  .build();
bjyberg commented 2 months ago

Sorry for the delay - took a while to figure out how to build/run, etc. Good learning experience haha! It works perfectly, thanks Kyle! If I can help by contributing any documentation, etc. when it is merged into the main branch, let me know! Happy to close this now if you are :)

kylebarron commented 2 months ago

Awesome, good to hear! Ideally most people will be writing GeoParquet via the geoarrow-wasm set of tools, like @geoarrow/geoparquet-wasm.

const wkb = geos.geosGeomToWKB(geomPtr) // returns a WKB buffer

Having two sets of Wasm bundles is a lot of code for the user to download and means that you need to have memory copies out of one Wasm memory space and then into the other's.

But alas, for now, if it works that's good!

A doc example would be welcome! You can include markdown in the /// in the Rust code here: https://github.com/kylebarron/parquet-wasm/blob/ba3c1618b1a1ece8704b9682eb2e4ca81286a8f4/src/writer_properties.rs#L160

That gets copied into the generated .d.ts doc comments