stac-utils / pgstac

Schema, functions and a python library for storing and accessing STAC collections and items in PostgreSQL
MIT License
146 stars 34 forks source link

[Feature request] Adding daily partitions #307

Open j-musial opened 1 week ago

j-musial commented 1 week ago

Next generation of optical satellites e.g. Sentinel-2 generates several thousands product per day at global scale and consequently the monthly PGSTAC partitions are too big. It would be advantageous to have daily partitions.

bitner commented 3 days ago

My general recommendations from my testing is that the ideal balance for partition size is targeting for each partition to be roughly in the single digit millions. Even at a 100,000 products per day (3 million / month) that should still be reasonable with monthly partitions.

There is definitely a balance here -- as you increase the number of partitions, the query planning process in Postgres gets slower and any searches that cover multiple partitions will also get much slower.

It is certainly possible for us to add a daily option for partition, but I have certainly never seen anything that I would expect to perform better with that level of partitioning.

j-musial commented 2 days ago

The use case for daily partitions are catalogues that are frequently queried for the most recent products. In our case the 70% queries are for the last week and 40% for the last day. Furthermore, if you have 3 million/month complex geometries/footprints then your queries will be significantly slower then for 100,000 products. Regarding the multi-temporal queries indeed number of partitions may decrease performance. However, once the 1000/10000 products are found (and this usually requires only few days and only few partitions scans) the query should stop.

Overall daily partitions would be useful option to have in PGSTAC so the user may choose what configuration works the best for him.

jonas-eberle commented 2 days ago

Please consider, as far as I know, if a request queries one specific STAC item (e.g., when calling https://pgstac.demo.cloudferro.com/collections/sentinel-2-l2a/items/S2B_MSIL2A_20240110T100309_N0510_R122_T33UVR_20240110T113053) then all partitions are queried until the scene has been found. I guess this is because postgres does not know in which partition this item is. At least I have seen such SQL queries when requesting a single item. With daily partitions this could lead to a lot of queries in the background.

@bitner Please correct, if this is wrong.

j-musial commented 2 days ago

@jonas-eberle in case of a single product indeed all partitions have to be queried until it is found but this is a simple scan. More troublesome are spatio-temporal queries over large AOI with complex footprints. Then you really would take advantage of daily partitions because only few of them will be queried until the max elements (1000 in our case) are found.

bitner commented 2 days ago

yep @jonas-eberle, you are exactly correct. Any query that is not bound by collection and/or time will need to scan through all the partitions. @j-musial this also means that an endpoint like /items/ is going to be very slow.

@j-musial Yes, it is a simple scan, but even a simple scan done 30X more often with the included additional planning penalty can make a huge difference. The way that pgstac works, if you are ordering your return data by datetime, we have a short cut that actually breaks the passed in query down into smaller chunks of time ranges that we scan serially. This allows the btree index to work very well even on larger tables with complex footprints. To be clear, I'm not opposed to adding a "day" option, I just want to explain that smaller partitions are not a magic bullet and may well likely make some things considerably worse for performance. I'd be happy to be show wrong though and would certainly accept a PR for the change, it's just not going to be a priority for me.

j-musial commented 2 days ago

@bitner thank you for the comprehensive explanation. I guess the optimal partitioning depends on the usage scenario but if PGSTAC does the serial scans indeed the performance gain may not be huge at the cost of additional time required for query planning. The best way to verify this would be to run different types of queries on monthly and daily partitions, but I do understand that the daily partitions may not be the most urgent feature to be added to PGSTAC. Anyway thanks for your great work!