Extremely Slow Adding New Products to Data Cube

opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time

http://www.opendatacube.org

Apache License 2.0

513 stars 178 forks source link

Extremely Slow Adding New Products to Data Cube #560

Open omad opened 6 years ago

omad commented 6 years ago

Expected behaviour

I should be able to add a new product to a Data Cube index and not have to wait.

Actual behaviour

It's very slow (tens of minutes) to add each new product to the production DEA Database.

Steps to reproduce

As an admin on a large Data Cube Index, run:

datacube product add <product-defn.yaml>

Environment information

Which datacube --version are you using? Open Data Cube core, version 1.6rc2+108.g096a26d5
What datacube deployment/enviornment are you running against? Database Server: agdc-db.nci.org.au NCI Module: dea/20180815

whatnick commented 4 years ago

Version:       1.7+0.g98cf9ba3.dirty
Config files:  <ODC config>
Host:          <OWS-DEV-RDS>:5432
Database:     ows
User:          ows
Environment:   None
Index Driver:  default

Valid connection:       YES

Still present in DEA-dev environment. Adding new products is slow. Killing the datacube add process and retrying makes the operation instant.

Kirill888 commented 4 years ago

@whatnick what you describe simply skips index generation step making future queries really slow, particularly on larger databases.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

whatnick commented 4 years ago

Bump!!

whatnick commented 4 years ago

Here is a benchmark

time datacube product add https://raw.githubusercontent.com/GeoscienceAustralia/digitalearthau/develop/digitalearthau/config/products/cemp_insar_radarsat2_displacement.yaml
Adding "cemp_insar_radarsat2_displacement" (this might take a while) DONE

real    13m50.068s
user    0m1.393s
sys     0m0.254s

sotosoul commented 1 year ago

In my case, adding a product to an already populated with $200\times10^6$ Datasets takes well over 20-30 minutes to complete. Exploring the running SQL statement in pgAdmin shows that Postgres is creating an INDEX for that product. The effects of this INDEX are impressive. If querying by dataset label takes about 1 min to return the unique result, querying by product and label takes only 2-3 sec!

omad commented 1 year ago

This is due to design decisions made with the current PostgreSQL database driver, where it creates multiple partial indexes when adding a new product, which scales in line with the number of indexed datasets.

This will be greatly improved, hopefully to the point of being a non issue with the new PostGIS driver coming in v1.9 or v2.