Efficiency of item_collection()

ZZMitch commented 5 months ago

Hello,

I use pystac_client as my window into STACs, from which I load satellite imagery items into stackstac (or odc-stac) to create xarray datacubes. I have been attempting to increase the efficiency of my processing workflows since I want to use it for some wider area/dense in time analysis.

One bottleneck I currently have with pystac_client is item_collection(). My typical workflow:

cat = pc.Client.open(...)

items = cat.search(bbox = ..., datetime = ..., collections = ...).item_collection()

stack = ss.stack(items, ...)

With search parameters that find 300 items, item_collection() requires about 50 seconds. If I try to grab more items (e.g., all HLS observations over 5-10 years) - this can take several minutes. I just wanted to understand if I am being as efficient as I can here, or if there is something I can do to reduce the time it takes to build the collection. Note that I am on a government network that could be the culprit causing some slow down.

Thank you for your time!

jsignell commented 5 months ago

I'm going to move this question to the pystac-client repo since it is primarily about the API. But my sense is that the performance will depend on the STAC API that you are accessing. If it is public then it would be helpful to have a functioning example.

TomAugspurger commented 5 months ago

Some providers will offer bulk access to STAC collections through geoparquet datasets (for example https://planetarycomputer.microsoft.com/docs/quickstarts/stac-geoparquet/).

Otherwise, you'll be limited by the speed of your network and the server (and pystac / pystac-client is currently sequential & syncrhonous. See #4 and https://github.com/stac-utils/pystac-client/issues/77 ). You might able to speed things up a bit by increasing the limit= in your .search() call, which might translate to fewer, larger HTTP requests.

ZZMitch commented 5 months ago

@TomAugspurger Very useful information thank you! Your links, especially this from #77 is close to what spurred me to ask this question.

I can provide some more context...

I have been interested in getting a better understanding of cloud-free data availability through time from HLS across large areas (e.g., make sure there is enough clear observations during a given year/time of year to make a project feasible). To do this across Canada, I created a 120 km fishnet and got points representing the centroid of each square (~900 points). Then I pretty much created a for loop that for each point...

Uses pystac-client to generate a STAC search based on a tiny bounding box around the point, date range etc.
Uses stackstac to generate 1 x 1 x Fmask x Ntimesteps xarrays
Other processing steps (e.g., merge duplicate observations)
Get the number of clear observations per month and add as a column in a csv to create figures from

There is also a simple try/except bit to avoid server issues stopping the process (can come back to those points at the end).

For a large number of points this can be a pretty slow process, taking anywhere from a minute or so (e.g., southern point in cloudy area with a few hundred time-steps to consider) to almost an hour (e.g., northern Arctic point with tons of orbit overlap with thousands of time-steps to consider). For ~900 points, it takes several days to complete.

This is usually a one-off thing I would run at the beginning of a project to explore the imagery time-series, but it serves as a good opportunity to find bottlenecks in my data access pipeline for when I attempt analyses that involve working with every pixel through time. For these 1 pixel x 1 band x many time-step x many points example, the pystac-client portion eats up a significant amount of time since so many images need to be considered (e.g., I probably "touched" almost 3 HLS million images to get those Canada-wide numbers).

All this is to say I am still pretty new to working with imagery through STAC and these other tools, and learning about best practices for given workflows by checking out GitHub discussions has been very helpful in my learning. I will consider the options you noted above and test out the point-based example you linked to above.

If you have any other tips related to this type of approach, feel free to share! I won't be able to do more testing on this until next week at the earliest, so if you have nothing else to add, you can close this and I can come back to it later.

ZZMitch commented 5 months ago

You might able to speed things up a bit by increasing the limit= in your .search() call, which might translate to fewer, larger HTTP requests.

This simple change (setting limit to 250, the max value allowed) led to a big speed up by itself! For a year of HLS data over a 60 x 60 km square I can now build a dask xarray with stackstac in under 10 seconds, when with the default limit (100?) it is closer to 30 seconds.

Also just an FYI that the pystac documentation for search is out-of-date: https://pystac-client.readthedocs.io/en/stable/api.html#pystac_client.Client.search

It still says that max_items defaults to 100.

ZZMitch commented 5 months ago

@TomAugspurger

After a little more testing, I think the current limit for .search() is actually around 10, rather than 100 as it says in the documentation? I ran my tests with limit = 100 and got around the same results (around 10 seconds) as with 250, which would make sense since the size of HLS L30 is 85 and S30 is 69 in this test case (and I do two separate searches since I have some specialized code for each constellation).

However, when I set limit to 10, I got similar search times as the default (around 30 seconds).

Also note I ran these tests 5 times each to confirm the numbers, while restarting my kernel each time to make sure they are independent (no cached data).

TomAugspurger commented 5 months ago

There might be some server-side settings that affect the default and maximum value for limit.

ZZMitch commented 5 months ago

Got it thanks, I will need to play around with this a bit more - since I started getting 502 Bad Gateway errors with limit = 250 in the search over larger grid squares (120 x 120 km in this case) - but not with limit = 100.

I may need a little system where I change the limit value based on the size of the area I am pulling data from to get best speed while avoiding server-related errors.

stac-utils / pystac-client

Efficiency of item_collection() #659