stac-utils / pgstac

Schema, functions and a python library for storing and accessing STAC collections and items in PostgreSQL
MIT License
153 stars 39 forks source link

search function runtime when working with large items | considerations regarding the --dehydrated flag in pypgstac #305

Closed MathewNWSH closed 1 month ago

MathewNWSH commented 2 months ago

Hello,

I’m facing a challenge in creating a high-performance pgstac database. So far, for small-scale work, the deployment from https://github.com/stac-utils/stac-fastapi-pgstac/blob/main/docker-compose.yml has been more than sufficient, but I noticed that when querying the catalog https://stac-cdse.eu/collections/sentinel-2-l2a/items?limit=1000, the process takes 1 minute and 33 seconds. This shouldn't be an issue for a test database with fewer than 10,000 items deployed on a large machine (in this way: https://github.com/MathewNWSH/creodias-customisation-script-pgstac?tab=readme-ov-file#customisation-script).

I’ve noticed that the search function in pgstac takes up most of the time. I decided to test its performance: 1min 21sec: select * from search('{"collections": ["sentinel-2-l2a"], "limit": 1000, "conf": {"nohydrate": false}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}'); 2.4 sec: select * from search('{"collections": ["sentinel-2-l2a"], "limit": 1000, "conf": {"nohydrate": true}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');

My task is to establish a configuration that allows serving all the required attributes while ensuring efficient performance. I thought that pypgstac would offer an option to load full-size items, which would make assembling responses later less time-consuming via pypgstac load ... --dehydrated falsebut I guess it's the default option, and using the --dehydrated flag results in:

Traceback (most recent call last):
  File "/opt/venv/python3.11/bin/pypgstac", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/opt/venv/python3.11/lib/python3.11/site-packages/pypgstac/pypgstac.py", line 125, in cli
    fire.Fire(PgstacCLI)
  File "/opt/venv/python3.11/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/python3.11/lib/python3.11/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/python3.11/lib/python3.11/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/python3.11/lib/python3.11/site-packages/pypgstac/pypgstac.py", line 76, in load
    loader.load_items(file, method, dehydrated, chunksize)
  File "/opt/venv/python3.11/lib/python3.11/site-packages/pypgstac/load.py", line 609, in load_items
    for chunkin in chunked_iterable(items, chunksize):
  File "/opt/venv/python3.11/lib/python3.11/site-packages/pypgstac/load.py", line 61, in chunked_iterable
    chunk = tuple(itertools.islice(it, size))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/python3.11/lib/python3.11/site-packages/pypgstac/load.py", line 577, in read_dehydrated
    item[field] = tab_split[i]
                  ~~~~~~~~~^^^
IndexError: list index out of range

**So I was kinda hoping for:

  1. loading full items into STAC, including attributes that repeat
  2. accepting the increased database size
  3. not wasting time on hydration since items would be hydrated by default**

I’m attaching a link to my sample data. https://s3.fra1-2.cloudferro.com/swift/v1/poland-stac/S2B_MSIL2A_20240110T100309_N0510_R122_T33UVR_20240110T113053.json (single item) https://s3.fra1-2.cloudferro.com/swift/v1/poland-stac/collection-sentinel-2-l2a.json (collection json) https://s3.fra1-2.cloudferro.com/swift/v1/poland-stac/poland-data.json (ndjson file storing almost 10k of items)

Can I ask for any suggestions? Is it possible to load items in their original size into the database so that creating responses later doesn’t take several minutes?

Thanks, Marcin

drnextgis commented 2 months ago

From my understanding, the --dehydrated option for the loader is unrelated to the search's "nohydrate" configuration. Its purpose is to enable the loader to process a dehydrated file (or iterable) containing items, such as those generated by tools like pg_dump or the PostgreSQL COPY command.

drnextgis commented 2 months ago

I ran a quick test on one of our collections and can confirm that using {"nohydrate": false} results in faster performance, as expected, though the difference is not that significant.

postgis=> select * from search('{"collections": ["l0"], "limit": 1000, "conf": {"nohydrate": false}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
Time: 545.357 ms

postgis=> select * from search('{"collections": ["l0"], "limit": 1000, "conf": {"nohydrate": true}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
Time: 112.718 ms
drnextgis commented 2 months ago

@MathewNWSH you can perform hydration on the stac-fastapi-pgstac side, see use_api_hydrate option.

drnextgis commented 2 months ago

I tested the process using your data, and here are the results:

postgis=# select * from search('{"collections": ["sentinel-2-l2a"], "limit": 1000, "conf": {"nohydrate": true}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
Time: 1374.108 ms (00:01.374)

postgis=# select * from search('{"collections": ["sentinel-2-l2a"], "limit": 1000, "conf": {"nohydrate": false}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
Time: 47322.193 ms (00:47.322)

To improve search performance, there are at least two possible approaches:

  1. Offload the CPU-intensive task of rehydrating items from the database to the application server, as I suggested earlier.
  2. Remove item_assets from the collection definition before ingestion. This way, items will be ingested into the items table with all their metadata, while base_item will be mostly empty. Dehydration will retain all the metadata.

Here are the results for the second approach:

postgis=# select * FROM collection_base_item('sentinel-2-l2a');
                                     collection_base_item                                     
----------------------------------------------------------------------------------------------
 {"type": "Feature", "assets": null, "collection": "sentinel-2-l2a", "stac_version": "1.1.0"}
(1 row)
postgis=# select * from search('{"collections": ["sentinel-2-l2a"], "limit": 1000, "conf": {"nohydrate": false}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
Time: 3105.260 ms (00:03.105)
bitner commented 1 month ago

Unfortunately, right now, there is no option to turn off dehydration when loading data. I would definitely recommend using the use_api_hydrate option with stac-fastapi-pgstac as the first option.

MathewNWSH commented 1 month ago

Okay, thank you. Switching to hydrating on the fastapi-pgstac side helped a lot.