opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time
http://www.opendatacube.org
Apache License 2.0
505 stars 176 forks source link

Postgres performance when `find_datasets` #1442

Closed sotosoul closed 1 year ago

sotosoul commented 1 year ago

Our ODC instance is deployed on AWS Aurora RDS and is populated with approximately 200 million datasets. The Aurora RDS is configured as Serverless v2 (64 - 128 ACUs) and has 4 replicas (1 writer + 3 readers) in total.

datacube version = 1.8.12

I am facing issues with its performance, especially when I'm attempting to find_datasets that would return more than a few 10s of thousands datasets.

As an example, a find_datasets search generates the following query:

SELECT agdc.dataset.id, agdc.dataset.metadata_type_ref, agdc.dataset.dataset_type_ref, agdc.dataset.metadata, agdc.dataset.archived, agdc.dataset.added, agdc.dataset.added_by, array((SELECT selected_dataset_location.uri_scheme || ':' || selected_dataset_location.uri_body AS anon_1 
FROM agdc.dataset_location AS selected_dataset_location 
WHERE selected_dataset_location.dataset_ref = agdc.dataset.id AND selected_dataset_location.archived IS NULL ORDER BY selected_dataset_location.added DESC, selected_dataset_location.id DESC)) AS uris 
FROM agdc.dataset 
WHERE agdc.dataset.archived IS NULL AND (tstzrange(least(agdc.common_timestamp((agdc.dataset.metadata #>> '{properties, dtr:start_datetime}')), agdc.common_timestamp((agdc.dataset.metadata #>> '{properties, datetime}'))), greatest(agdc.common_timestamp((agdc.dataset.metadata #>> '{properties, dtr:end_datetime}')), agdc.common_timestamp((agdc.dataset.metadata #>> '{properties, datetime}'))), '[]') && tstzrange('2017-01-01T00:00:00+00:00'::timestamptz, '2023-05-31T23:59:59.999999+00:00'::timestamptz, '[]')) AND agdc.dataset.dataset_type_ref = 72

A SELECT COUNT (*) ... relevant query (same search parameters) will show about 8 million records returned.

The related database session gets a IO: DataFileRead wait event and stays there for a long time. I don't know if it completes since either datacube gets killed or I simply kill the process after a lot of time (maybe more than 1 h).

If I copy-paste the generated query to pgAdmin and execute it directly from my computer, I'm getting a ClientWrite event.

I'm not sure if this is a datacube/sqlalchemy issue or a postgres issue, so any feedback is appreciated.

SpacemanPaul commented 1 year ago

I agree database performance is a huge issue for the ODC but there is very little that can be done within the confines of the existing architecture without compromising backwards compatibility. Work is underway to address some of the underlying issues in the new postgis driver (labeled "experimental" in 1.8.12) and should start to bear fruit in release 1.9.0 which is currently in the planning stages. We certainly expect these issues to be fully resolved (at the price of major breaks in backwards compatibility) in release 2.0.0.

But unfortunately there is unlikely to be any significant improvements before these proposed future major releases.

SpacemanPaul commented 1 year ago

duplicate of #542