Open floriandeboissieu opened 2 months ago
As an ItemCollection cannot garanty that all properties are shared by all items (or am I wrong about that?):
That's correct in general. That said, stac-geoparquet is mainly focused on the use case where items in a collection are homogenous.
Just to confirm, you're using stac_geoparquet.to_geodataframe
that's raising the error? You might want to try the newer stac_geoparquet.arrow
methods, which will probably be more robust:
In [41]: rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)
In [42]: df = rbr.read_pandas(types_mapper=pd.ArrowDtype)
In [43]: df
Out[43]:
assets bbox collection ... s2:water_percentage sat:orbit_state sat:relative_orbit
0 {'AOT': {'gsd': 10.0, 'href': 'https://sentine... {'xmin': 6.2806415, 'ymin': 47.7341556, 'xmax'... sentinel-2-l2a ... 0.171997 ascending 8
1 {'AOT': {'gsd': 10.0, 'href': 'https://sentine... {'xmin': 6.2806415, 'ymin': 47.7341556, 'xmax'... sentinel-2-l2a ... 0.556677 descending 108
2 {'AOT': {'gsd': 10.0, 'href': 'https://sentine... {'xmin': 5.6670094, 'ymin': 47.6908511, 'xmax'... sentinel-2-l2a ... 0.190585 descending 108
[3 rows x 42 columns]
Just to confirm, you're using stac_geoparquet.to_geodataframe that's raising the error?
That's right, I just updated the example to be clear.
You might want to try the newer stac_geoparquet.arrow methods, which will probably be more robust
That is working fine well with missing properties (replaced by <NA>
) and showing more specific types in the pandas:
assets struct<AOT: struct<gsd: double, href: string, ...
bbox struct<xmin: double, ymin: double, xmax: doubl...
collection string[pyarrow]
geometry binary[pyarrow]
id string[pyarrow]
links list<item: struct<href: string, rel: string, t...
...
s2:water_percentage double[pyarrow]
sat:orbit_state string[pyarrow]
sat:relative_orbit int64[pyarrow]
tilename string[pyarrow]
dtype: object
However, df
is a pandas.DataFrame
.
In order to get a geopandas.GeoDataFrame
:
import geopandas as gpd
gs = gpd.GeoSeries.from_wkb(df.geometry)
gdf = gpd.GeoDataFrame(df, geometry=gs) # keeps pyarrow types
gdf.dtypes
Out [80]:
assets struct<AOT: struct<gsd: double, href: string, ...
bbox struct<xmin: double, ymin: double, xmax: doubl...
collection string[pyarrow]
geometry geometry
id string[pyarrow]
# an alternative converting `timestamp` type to `datetime64` and `<NA>` to `NaN`,
# but other specific pyarrow types (string, struct, list) converted to `object` type:
rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)
gdf = gpd.GeoDataFrame.from_arrow(rbr)
gdf.dtypes
Out [81]:
assets object
bbox object
collection object
geometry geometry
id object
...
datetime datetime64[us, UTC]
...
Any reason for not having stac_geoparquet.to_geodataframe
changed for those 3 or 4 lines instead of the actual code?
That would certainly make it more robust and easier to maintain, wouldn't it?
Yeah, the implementation in stac_geoparquet.arrow
is newer (and better). I just don't think anyone has gotten around to proposing a deprecation path for changing to_geodataframe
to use the pyarrow implementation.
Just for memory, my perf tests show that the old way is about 10x faster than the new:
import pystac_client
import stac_geoparquet
from time import time
from timeit import timeit
import pandas as pd
import geopandas as gpd
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1",
)
search = catalog.search(
collections=["sentinel-2-l2a"],
bbox=[6.5425, 47.9044, 6.5548, 47.9091],
datetime="2018-01-01/2024-07-20", # starting from 2024-08-11, 's2:dark_features_percentage' was removed
sortby="datetime",
)
coll = search.item_collection()
print(set(coll[0].properties.keys()).symmetric_difference(coll[-1].properties.keys()))
# {'s2:dark_features_percentage'} # this property is missing in coll[1:3], due to a different processing baseline (05.11 instead of 05.10)
nb_runs = 10
#### stac-geoparquet 0.3.2 fills with NaN missing properties
#### stac-geoparquet > 0.3.2 raise an error if properties are missing in some items
def fun1(coll):
records = coll.to_dict()["features"]
gdf = stac_geoparquet.to_geodataframe(records)
return gdf
#### stac-geoparquet 0.6.0
def fun2(coll):
rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)
df = rbr.read_pandas(types_mapper=pd.ArrowDtype)
gs = gpd.GeoSeries.from_wkb(df.geometry)
gdf = gpd.GeoDataFrame(df, geometry=gs) # keeps pyarrow types
return gdf
duration = timeit("fun1(coll)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun1: {avg_duration:.4f} seconds')
if stac_geoparquet.__version__ >= '0.6.0':
duration = timeit("fun2(coll)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun2: {avg_duration:.4f} seconds')
print(f"Item-Collection size: {len(coll)}")
# fun1: 0.1396 seconds
# fun2: 1.1298 seconds
# Item-Collection size: 1868
It would be interesting to time just
rbr = stac_geoparquet.arrow.parse_stac_items_to_arrow(coll)
table = rbr.read_all()
I.e. which part of that is slow? The creation of Arrow data, the conversion to Pandas, or the parsing of geometries?
It's also important to note that the newer version is doing a lot more than the original version. For one, it scans the entire input data first to infer a strict schema that represents the full input dataset. It also does a few more transformations to match the latest stac-geoparquet spec.
@kylebarron, adding the following to the if statement of previous post shows that most of the processing time is due to parse_stac_items_to_arrow
:
duration = timeit("fun3(coll)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun3: {avg_duration:.4f} seconds')
rbr = fun3(coll)
duration = timeit("fun4(rbr)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun4: {avg_duration:.4f} seconds')
df = fun4(rbr)
duration = timeit("fun5(df)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun5: {avg_duration:.4f} seconds')
gs = fun5(df)
duration = timeit("fun6(df, gs)", globals=globals(), number=nb_runs)
avg_duration = duration/nb_runs
print(f'fun6: {avg_duration:.4f} seconds')
# fun1: 0.1043 seconds
# fun2: 0.6489 seconds
# fun3: 0.6396 seconds
# fun4: 0.0013 seconds
# fun5: 0.0001 seconds
# fun6: 0.0002 seconds
# Item-Collection size: 1868
It may occur that the items of an ItemCollection do not share all the property keys. An example:
In stac-geoparquet <= 3.2, the geodataframe was built from a list of dict, which was introducing NaN where a property was missing. Since commit #fb798f4 (included in version 4.0+), the geodataframe is built from a dict of lists (for acceleration I suppose), thus a missing property in an item makes the operation fail with error at L177:
All arrays must be of the same length
As an ItemCollection cannot garanty that all properties are shared by all items (or am I wrong about that?):