Closed bjyberg closed 2 months ago
I think the issue here is that Arrow schema metadata and Parquet key-value file metadata are technically different concepts. And so I assume that the Rust parquet
crate does not automatically write the table metadata onto the Parquet file.
Other libraries include Arrow table schema metadata onto the Parquet key-value metadata, so maybe we should do the same here when writing.
I am able to open and view the geo metadata key in pyarrow
Is this on the table schema or the parquet metadata. They're two different things.
The Parquet schema is accessible with pyarrow.parquet.read_metadata(...).metadata.get(b'geo')
while the Arrow schema is stored separately in the Parquet file and is accessible with pyarrow.parquet.read_schema(...).metadata.get(b'geo')
. I'm guessing only the latter one exists in the Parquet file you're writing.
Thanks for such a quick response! And nice guess, I just checked and you're absolutely correct - Only pyarrow.parquet.read_schema(...).metadata.get(b'geo')
exists in the file I've written. So would the solution be to write the geo metadata to the parquet metadata rather than the arrow schema? Is that possible at the moment? Thanks again!
I think we just need to implement this method: https://github.com/kylebarron/parquet-wasm/blob/ef8ca3b296f21793330907c43498326242bd9443/src/writer_properties.rs#L145-L154
Can you test from this branch https://github.com/kylebarron/parquet-wasm/pull/503? There are developer docs here for building.
Usage should be something like:
import {
WriterProperties,
WriterPropertiesBuilder,
} from "./pkg/esm/parquet_wasm.js";
let props = new Map<string, string>();
props.set("geo", "...");
let writerProps = new WriterPropertiesBuilder()
.setKeyValueMetadata(props)
.build();
Sorry for the delay - took a while to figure out how to build/run, etc. Good learning experience haha! It works perfectly, thanks Kyle! If I can help by contributing any documentation, etc. when it is merged into the main branch, let me know! Happy to close this now if you are :)
Awesome, good to hear! Ideally most people will be writing GeoParquet via the geoarrow-wasm set of tools, like @geoarrow/geoparquet-wasm
.
const wkb = geos.geosGeomToWKB(geomPtr) // returns a WKB buffer
Having two sets of Wasm bundles is a lot of code for the user to download and means that you need to have memory copies out of one Wasm memory space and then into the other's.
But alas, for now, if it works that's good!
A doc example would be welcome! You can include markdown in the ///
in the Rust code here: https://github.com/kylebarron/parquet-wasm/blob/ba3c1618b1a1ece8704b9682eb2e4ca81286a8f4/src/writer_properties.rs#L160
That gets copied into the generated .d.ts
doc comments
Awesome package and thanks you for all of the examples of using it with Geoparquets on Observable - super useful! This is less of a bug/issue and more of a question. I am trying to figure out how to write a valid geoparquet with this package, and I am curious if you have any suggestions. Here is a quick example of what I am trying to do:
I've tested this in Observable with parquet-wasm versions: 0.4.0/arrow1, 0.5.0/arrow1, and 0.6.0-beta.3 (below example)
I am able to open and view the geo metadata key in pyarrow, but the 'geo' key is not recognized by GDAL, your geoparquet-wasm package (super excited for this btw!), GPQ from planet, etc. The error is always along the lines of -
Error: General error: expected a 'geo' key in GeoParquet metadata
. I'm wondering if the issue lies with how I am assigning the schema to the arrow table? If so, maybe it isn't a question for here... I definitely lack in-depth knowledge of the arrow, parquet and geoparquet formats, so I was hoping to get some expertise on your end, given your involvement with all of them! Is what I am attempting possible/worth doing? Or am I better off waiting for further development on the geoparquet-wasm side of things?