Item and Collection parquet Models

vincentsarago commented 1 year ago

This should note be considered as an Issue but as a Discussion (but not enabled in this repo yet)

👋 @TomAugspurger , Thanks for starting this tool. I'm personally interested in stac-geoparquet to create easily shareable files for large STAC Collections. My usual way of doing is to create NewLine delimited GeoJSON (https://github.com/vincentsarago/MAXAR_opendata_to_pgstac) but GeoParquet seems to be a nice alternative and will also provide some simple Query capacity.

I've looked at the code and implemented a simplified version of a STAC to GeoParquet function. I say simplified because I really tried to minimize the data model, mostly by not creating column for properties properties.

from typing import Any, Dict

import json
import geopandas as gpd
import pandas as pd
import shapely

# From https://github.com/stac-utils/stac-geoparquet/blob/main/stac_geoparquet/utils.py#L60-L72
def fix_empty_multipolygon(
    item_geometry: Dict[str, Any]
) -> shapely.geometry.base.BaseGeometry:
    # Filter out missing geoms in MultiPolygons
    # https://github.com/shapely/shapely/issues/1407
    # geometry = [shapely.geometry.shape(x["geometry"]) for x in items2]
    if item_geometry["type"] == "MultiPolygon":
        item_geometry = dict(item_geometry)
        item_geometry["coordinates"] = [
            x for x in item_geometry["coordinates"] if any(x)
        ]

    return shapely.geometry.shape(item_geometry)

    with open("items_s3.json", "r") as f:
    items = [json.loads(line.rstrip()) for line in f.readlines()]

# Create A GeoParquet
geometry = []
for item in items:
    # Create geometries
    geometry.append(fix_empty_multipolygon(item.pop("geometry")))

    # Create `datetime`/`start_datetime`/`end_datetime` columns
    item["datetime"] = item["properties"]["datetime"]
    item["start_datetime"] = item["properties"].get("start_datetime")
    item["end_datetime"] = item["properties"].get("end_datetime")

    # TODO: Maybe split the bbox in minx/miny/maxx/maxY columns

    # Convert Dict and List to String
    for k, v in item.items():
        if isinstance(v, (dict, list)):
            item[k] = json.dumps(v)

gdf = gpd.GeoDataFrame(items, geometry=geometry, crs="WGS84")   
gdf.to_parquet("items.parquet")

In ☝️ I'm creating columns for each STAC object properties (not the item properties) and creating columns for the datetimes properties (to ease temporal filtering). But then I'm creating string for all the List and Dict object.

Model:	Type	stac_version	id	properties	datetime	start_datetime	end_datetime	links	assets	bbox	stac_extensions	collection	geometry
str	str	str	JSON str	date	date	date	JSON str	JSON str	JSON str	JSON str	str	geom

I do the same for the collection

with open("collections.json", "r") as f:
    collections = [json.loads(line.rstrip()) for line in f.readlines()]

# Create A GeoParquet
geometry = []
for collection in collections:
    # Create geometries from extent
    geometry.append(
        # The extent is in form of `global bbox`, `small bbox1`, `small bbox 2` , ....
        shapely.geometry.GeometryCollection(
            [
                shapely.geometry.Polygon.from_bounds(*bbox)
                for bbox in collection["extent"]["spatial"]["bbox"]
            ]
        )
    )

    interval = collection["extent"]["temporal"]["interval"]
    # Create `start_datetime`/`end_datetime` columns
    collection["start_datetime"] = collection["extent"]["temporal"]["interval"][0][0] if interval else None
    collection["end_datetime"] = collection["extent"]["temporal"]["interval"][0][1] if interval else None

    # Convert Dict to String
    for k, v in collection.items():
        if isinstance(v, (dict, list)):
            collection[k] = json.dumps(v)

gdf = gpd.GeoDataFrame(collections, geometry=geometry, crs="WGS84")
gdf.to_parquet("collections.parquet")

cc @kylebarron @gadomski

TomAugspurger commented 1 year ago

Can you expand a bit on the reason for JSON encoding the fields like links, assets, properties, etc.?

Working with these nested fields can be a pain, but it seems to me like the tooling around these nested dtypes is improving (https://github.com/pandas-dev/pandas/issues/54938, etc.). Being able to filter on, e.g. .properties.eo:cloud_cover at the parquet level seems pretty useful.

TomAugspurger commented 1 year ago

BTW: I really like the idea of defining a data model (or multiple?), maybe as a pyarrow or parquet schema, for this type of data, and properly documenting it.

https://arrow.apache.org/blog/2023/04/11/our-journey-at-f5-with-apache-arrow-part-1/ and https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/ are some pretty detailed blog posts on making a data model for OpenTelemetry data.

kylebarron commented 1 year ago

BTW: I really like the idea of defining a data model and properly documenting it.

💯 there's a STAC Sprint next week that I think this would fit into well. I'm a fan of having arrow-native/parquet-native types for stac-geoparquet. Maybe I'll try to write up a "mini spec" for this with a read and write implementation using pyarrow in python? Maybe as a PR here? I think using pyarrow directly is likely to have a lot better control over the exact representation, and especially should make it easier to dictionary-encode specific string columns, which should save a ton of memory

TomAugspurger commented 1 year ago

I think using pyarrow directly is likely to have a lot better control over the exact representation, and especially should make it easier to dictionary-encode specific string columns, which should save a ton of memory

+1 to using arrow directly (and then somehow adding the geoarrow metadata.). There's a few places where we have to fixup issues with object-dtype ndarrays we're getting from pandas.

If we do need geopandas for anything, then we can explore pandas' new-ish support for arrow-backed arrays.

stac-utils / stac-geoparquet

Item and Collection parquet Models #24