stac-utils / pgstac

Schema, functions and a python library for storing and accessing STAC collections and items in PostgreSQL
MIT License
153 stars 39 forks source link

Spatial partitions for big collections #313

Closed betolink closed 2 weeks ago

betolink commented 1 month ago

Doing a quick look at the documentation I couldn't find information about spatial partitioning, the use case I'm interested in is when we have a few collections but they have a lot of items, say all of Landsat/Sentinel in multiple versions (10M+). Is something like this available? I'm working with time series at sub-continental scale, e.g. "all of the mission for southern Greenland where..." I wonder if global grids could be used to define how to partition each collection.

bitner commented 4 weeks ago

No. There is no spatial partitioning possible and it would be a huge effort to add.

The partitioning options in pgstac (initial partition by collection and optional sub-partitions by month or year) are really optimized for two situations.

1) Data Management -- Being able to quickly drop partitions rather than very slow delete statements (Say you want to remove a collection -- rather than DELETE FROM items WHERE collection='landsat'; which would be very slow, you can just DROP TABLE _items_3)

2) The ability to use constraint exclusion to limit the tables to scan -- we use this in a very targeted manner with the optimizations that pgstac provides for the most common case with STAC where you want say the most recent 100 items that match given filters. In pgstac, rather than evaluating the filters (attributes, temporal, spatial ....) across the entire set of records, we actually step through time chunks until we reach the limit that we asked for. For example if you want to get the most recent records that are in Minnesota and have less than 10% cloud cover, what pgstac will actually do is cycle through queries like this: SELECT * FROM items WHERE st_intersects(geometry, <MN Geometry>) and properties->'eo:cloud_cover' < 10 AND datetime < '2024-10-30 00:00:00Z' and datetime >= '2024-10-01 00:00:00Z' SELECT * FROM items WHERE st_intersects(geometry, <MN Geometry>) and properties->'eo:cloud_cover' < 10 AND datetime < '2024-10-01 00:00:00Z' and datetime >= '2024-09-01 00:00:00Z' ... until we have received 100 records

This actually lets our spatial intersection indexes be much more effective as well as with the smaller time windows, the selectivity of that spatial index is likely to be far higher.

10 million records is really not actually that many, but if you wanted to break things down smaller, adding temporal partitioning would be your only option.

betolink commented 2 weeks ago

Thanks for the detailed answer, we're scanning about 30M records (all landsats+sentinels) and I was looking to further optimize the partitioning to take advantage of parallel execution over the geom index. Just out of curiosity, do you guys have benchmarks we can take a look at? no worries if not, I'm closing this issue, Thanks!

bitner commented 2 weeks ago

Sorry, we don't have any benchmarks, but based on other experience, I would not expect spatial partitioning to make any difference (and actually, there is a good chance it would make things worse) in the pgstac context.