Closed MathewNWSH closed 1 month ago
From my understanding, the --dehydrated
option for the loader is unrelated to the search's "nohydrate" configuration. Its purpose is to enable the loader to process a dehydrated file (or iterable) containing items, such as those generated by tools like pg_dump
or the PostgreSQL COPY
command.
I ran a quick test on one of our collections and can confirm that using {"nohydrate": false}
results in faster performance, as expected, though the difference is not that significant.
postgis=> select * from search('{"collections": ["l0"], "limit": 1000, "conf": {"nohydrate": false}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
Time: 545.357 ms
postgis=> select * from search('{"collections": ["l0"], "limit": 1000, "conf": {"nohydrate": true}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
Time: 112.718 ms
@MathewNWSH you can perform hydration on the stac-fastapi-pgstac side, see use_api_hydrate option.
I tested the process using your data, and here are the results:
postgis=# select * from search('{"collections": ["sentinel-2-l2a"], "limit": 1000, "conf": {"nohydrate": true}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
Time: 1374.108 ms (00:01.374)
postgis=# select * from search('{"collections": ["sentinel-2-l2a"], "limit": 1000, "conf": {"nohydrate": false}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
Time: 47322.193 ms (00:47.322)
To improve search performance, there are at least two possible approaches:
item_assets
from the collection definition before ingestion. This way, items will be ingested into the items
table with all their metadata, while base_item
will be mostly empty. Dehydration will retain all the metadata.Here are the results for the second approach:
postgis=# select * FROM collection_base_item('sentinel-2-l2a');
collection_base_item
----------------------------------------------------------------------------------------------
{"type": "Feature", "assets": null, "collection": "sentinel-2-l2a", "stac_version": "1.1.0"}
(1 row)
postgis=# select * from search('{"collections": ["sentinel-2-l2a"], "limit": 1000, "conf": {"nohydrate": false}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
Time: 3105.260 ms (00:03.105)
Unfortunately, right now, there is no option to turn off dehydration when loading data. I would definitely recommend using the use_api_hydrate option with stac-fastapi-pgstac as the first option.
Okay, thank you. Switching to hydrating on the fastapi-pgstac side helped a lot.
Hello,
I’m facing a challenge in creating a high-performance pgstac database. So far, for small-scale work, the deployment from https://github.com/stac-utils/stac-fastapi-pgstac/blob/main/docker-compose.yml has been more than sufficient, but I noticed that when querying the catalog https://stac-cdse.eu/collections/sentinel-2-l2a/items?limit=1000, the process takes 1 minute and 33 seconds. This shouldn't be an issue for a test database with fewer than 10,000 items deployed on a large machine (in this way: https://github.com/MathewNWSH/creodias-customisation-script-pgstac?tab=readme-ov-file#customisation-script).
I’ve noticed that the
search
function in pgstac takes up most of the time. I decided to test its performance: 1min 21sec:select * from search('{"collections": ["sentinel-2-l2a"], "limit": 1000, "conf": {"nohydrate": false}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
2.4 sec:select * from search('{"collections": ["sentinel-2-l2a"], "limit": 1000, "conf": {"nohydrate": true}, "fields": {"include": [], "exclude": []}, "filter-lang": "cql-json"}');
My task is to establish a configuration that allows serving all the required attributes while ensuring efficient performance. I thought that pypgstac would offer an option to load full-size items, which would make assembling responses later less time-consuming via
pypgstac load ... --dehydrated false
but I guess it's the default option, and using the--dehydrated
flag results in:**So I was kinda hoping for:
I’m attaching a link to my sample data. https://s3.fra1-2.cloudferro.com/swift/v1/poland-stac/S2B_MSIL2A_20240110T100309_N0510_R122_T33UVR_20240110T113053.json (single item) https://s3.fra1-2.cloudferro.com/swift/v1/poland-stac/collection-sentinel-2-l2a.json (collection json) https://s3.fra1-2.cloudferro.com/swift/v1/poland-stac/poland-data.json (ndjson file storing almost 10k of items)
Can I ask for any suggestions? Is it possible to load items in their original size into the database so that creating responses later doesn’t take several minutes?
Thanks, Marcin