to_geodataframe error: All arrays must be of the same length

floriandeboissieu commented 2 months ago

It may occur that the items of an ItemCollection do not share all the property keys. An example:

import pystac_client
import stac_geoparquet

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
)

search = catalog.search(
    collections=["sentinel-2-l2a"],
    bbox=[6.5425, 47.9044, 6.5548, 47.9091],
    datetime="2024-07-20/2024-08-11",
    query={"eo:cloud_cover": {"lt": 30.}},
    sortby="datetime",
)

coll = search.item_collection()

print(set(coll[0].properties.keys()).symmetric_difference(coll[1].properties.keys()))
# {'s2:dark_features_percentage'} # this property is missing in coll[1:3], due to a different processing baseline (05.11 instead of 05.10)

records = coll.to_dict()["features"]
stac_geoparquet.to_geodataframe(records)
# *** ValueError: All arrays must be of the same length

In stac-geoparquet <= 3.2, the geodataframe was built from a list of dict, which was introducing NaN where a property was missing. Since commit #fb798f4 (included in version 4.0+), the geodataframe is built from a dict of lists (for acceleration I suppose), thus a missing property in an item makes the operation fail with error at L177: All arrays must be of the same length

As an ItemCollection cannot garanty that all properties are shared by all items (or am I wrong about that?):

wouldn't it be wise to remove properties that are not shared (e.g. properties with length smaller than others) or fill the missing values?
or is it wanted that this issue is delegated to the user?

TomAugspurger commented 2 months ago

As an ItemCollection cannot garanty that all properties are shared by all items (or am I wrong about that?):

That's correct in general. That said, stac-geoparquet is mainly focused on the use case where items in a collection are homogenous.

Just to confirm, you're using stac_geoparquet.to_geodataframe that's raising the error? You might want to try the newer stac_geoparquet.arrow methods, which will probably be more robust:

In [41]: rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)

In [42]: df = rbr.read_pandas(types_mapper=pd.ArrowDtype)

In [43]: df
Out[43]:
                                              assets                                               bbox      collection  ... s2:water_percentage sat:orbit_state sat:relative_orbit
0  {'AOT': {'gsd': 10.0, 'href': 'https://sentine...  {'xmin': 6.2806415, 'ymin': 47.7341556, 'xmax'...  sentinel-2-l2a  ...            0.171997       ascending                  8
1  {'AOT': {'gsd': 10.0, 'href': 'https://sentine...  {'xmin': 6.2806415, 'ymin': 47.7341556, 'xmax'...  sentinel-2-l2a  ...            0.556677      descending                108
2  {'AOT': {'gsd': 10.0, 'href': 'https://sentine...  {'xmin': 5.6670094, 'ymin': 47.6908511, 'xmax'...  sentinel-2-l2a  ...            0.190585      descending                108

[3 rows x 42 columns]

floriandeboissieu commented 2 months ago

Just to confirm, you're using stac_geoparquet.to_geodataframe that's raising the error?

That's right, I just updated the example to be clear.

You might want to try the newer stac_geoparquet.arrow methods, which will probably be more robust

That is working fine well with missing properties (replaced by <NA>) and showing more specific types in the pandas:

assets                                     struct<AOT: struct<gsd: double, href: string, ...
bbox                                       struct<xmin: double, ymin: double, xmax: doubl...
collection                                                                   string[pyarrow]
geometry                                                                     binary[pyarrow]
id                                                                           string[pyarrow]
links                                      list<item: struct<href: string, rel: string, t...
...
s2:water_percentage                                                          double[pyarrow]
sat:orbit_state                                                              string[pyarrow]
sat:relative_orbit                                                            int64[pyarrow]
tilename                                                                     string[pyarrow]
dtype: object

However, df is a pandas.DataFrame.

In order to get a geopandas.GeoDataFrame:

import geopandas as gpd
gs = gpd.GeoSeries.from_wkb(df.geometry)
gdf = gpd.GeoDataFrame(df, geometry=gs) # keeps pyarrow types
gdf.dtypes

Out  [80]: 
assets                                     struct<AOT: struct<gsd: double, href: string, ...
bbox                                       struct<xmin: double, ymin: double, xmax: doubl...
collection                                                                   string[pyarrow]
geometry                                                                            geometry
id                                                                           string[pyarrow]

# an alternative converting `timestamp` type to `datetime64` and `<NA>` to `NaN`,
# but other specific pyarrow types (string, struct, list) converted to `object` type:
rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)
gdf = gpd.GeoDataFrame.from_arrow(rbr)
gdf.dtypes

Out  [81]: 
assets                                                  object
bbox                                                    object
collection                                              object
geometry                                              geometry
id                                                      object
...
datetime                                   datetime64[us, UTC]
...

Any reason for not having stac_geoparquet.to_geodataframe changed for those 3 or 4 lines instead of the actual code?

That would certainly make it more robust and easier to maintain, wouldn't it?

TomAugspurger commented 2 months ago

Yeah, the implementation in stac_geoparquet.arrow is newer (and better). I just don't think anyone has gotten around to proposing a deprecation path for changing to_geodataframe to use the pyarrow implementation.

floriandeboissieu commented 2 months ago

Just for memory, my perf tests show that the old way is about 10x faster than the new:

import pystac_client
import stac_geoparquet
from time import time
from timeit import timeit
import pandas as pd
import geopandas as gpd

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
)

search = catalog.search(
    collections=["sentinel-2-l2a"],
    bbox=[6.5425, 47.9044, 6.5548, 47.9091],
    datetime="2018-01-01/2024-07-20", # starting from 2024-08-11, 's2:dark_features_percentage' was removed
    sortby="datetime",
)

coll = search.item_collection()

print(set(coll[0].properties.keys()).symmetric_difference(coll[-1].properties.keys()))
# {'s2:dark_features_percentage'} # this property is missing in coll[1:3], due to a different processing baseline (05.11 instead of 05.10)

nb_runs = 10
#### stac-geoparquet 0.3.2 fills with NaN missing properties
#### stac-geoparquet > 0.3.2 raise an error if properties are missing in some items
def fun1(coll):
    records = coll.to_dict()["features"]
    gdf = stac_geoparquet.to_geodataframe(records)
    return gdf

#### stac-geoparquet 0.6.0
def fun2(coll):
    rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)
    df = rbr.read_pandas(types_mapper=pd.ArrowDtype)
    gs = gpd.GeoSeries.from_wkb(df.geometry)
    gdf = gpd.GeoDataFrame(df, geometry=gs) # keeps pyarrow types
    return gdf

duration = timeit("fun1(coll)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun1: {avg_duration:.4f} seconds')

if stac_geoparquet.__version__ >= '0.6.0':
    duration = timeit("fun2(coll)", globals=globals(), number=nb_runs)
    avg_duration = duration/nb_runs
    print(f'fun2: {avg_duration:.4f} seconds')

print(f"Item-Collection size: {len(coll)}")

# fun1: 0.1396 seconds
# fun2: 1.1298 seconds
# Item-Collection size:  1868

kylebarron commented 2 months ago

It would be interesting to time just

rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)
table = rbr.read_all()

I.e. which part of that is slow? The creation of Arrow data, the conversion to Pandas, or the parsing of geometries?

It's also important to note that the newer version is doing a lot more than the original version. For one, it scans the entire input data first to infer a strict schema that represents the full input dataset. It also does a few more transformations to match the latest stac-geoparquet spec.

floriandeboissieu commented 2 months ago

@kylebarron, adding the following to the if statement of previous post shows that most of the processing time is due to parse_stac_items_to_arrow:

    duration = timeit("fun3(coll)", globals=globals(), number=nb_runs)
    avg_duration = duration/nb_runs
    print(f'fun3: {avg_duration:.4f} seconds')

    rbr = fun3(coll)
    duration = timeit("fun4(rbr)", globals=globals(), number=nb_runs)
    avg_duration = duration/nb_runs
    print(f'fun4: {avg_duration:.4f} seconds')

    df = fun4(rbr)
    duration = timeit("fun5(df)", globals=globals(), number=nb_runs)
    avg_duration = duration/nb_runs
    print(f'fun5: {avg_duration:.4f} seconds')

    gs = fun5(df)
    duration = timeit("fun6(df, gs)", globals=globals(), number=nb_runs)
    avg_duration = duration/nb_runs
    print(f'fun6: {avg_duration:.4f} seconds')

# fun1: 0.1043 seconds
# fun2: 0.6489 seconds
# fun3: 0.6396 seconds
# fun4: 0.0013 seconds
# fun5: 0.0001 seconds
# fun6: 0.0002 seconds
# Item-Collection size: 1868

stac-utils / stac-geoparquet

to_geodataframe error: All arrays must be of the same length #76