Closed vincentsarago closed 1 year ago
Can you expand a bit on the reason for JSON encoding the fields like links, assets, properties, etc.?
Working with these nested fields can be a pain, but it seems to me like the tooling around these nested dtypes is improving (https://github.com/pandas-dev/pandas/issues/54938, etc.). Being able to filter on, e.g. .properties.eo:cloud_cover
at the parquet level seems pretty useful.
BTW: I really like the idea of defining a data model (or multiple?), maybe as a pyarrow or parquet schema, for this type of data, and properly documenting it.
https://arrow.apache.org/blog/2023/04/11/our-journey-at-f5-with-apache-arrow-part-1/ and https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/ are some pretty detailed blog posts on making a data model for OpenTelemetry data.
BTW: I really like the idea of defining a data model and properly documenting it.
💯 there's a STAC Sprint next week that I think this would fit into well. I'm a fan of having arrow-native/parquet-native types for stac-geoparquet. Maybe I'll try to write up a "mini spec" for this with a read and write implementation using pyarrow in python? Maybe as a PR here? I think using pyarrow directly is likely to have a lot better control over the exact representation, and especially should make it easier to dictionary-encode specific string columns, which should save a ton of memory
I think using pyarrow directly is likely to have a lot better control over the exact representation, and especially should make it easier to dictionary-encode specific string columns, which should save a ton of memory
+1 to using arrow directly (and then somehow adding the geoarrow metadata.). There's a few places where we have to fixup issues with object-dtype ndarrays we're getting from pandas.
If we do need geopandas for anything, then we can explore pandas' new-ish support for arrow-backed arrays.
This should note be considered as an Issue but as a Discussion (but not enabled in this repo yet)
👋 @TomAugspurger , Thanks for starting this tool. I'm personally interested in stac-geoparquet to create easily shareable files for large STAC Collections. My usual way of doing is to create NewLine delimited GeoJSON (https://github.com/vincentsarago/MAXAR_opendata_to_pgstac) but GeoParquet seems to be a nice alternative and will also provide some simple Query capacity.
I've looked at the code and implemented a
simplified
version of a STAC to GeoParquet function. I say simplified because I really tried to minimize the data model, mostly by not creating column for properties properties.In ☝️ I'm creating
columns
for each STAC object properties (not the item properties) and creating columns for the datetimes properties (to ease temporal filtering). But then I'm creatingstring
for all the List and Dict object.I do the same for the collection
cc @kylebarron @gadomski