opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time
http://www.opendatacube.org
Apache License 2.0
505 stars 176 forks source link

Optimise database load when indexing without lineage verification #909

Closed Kirill888 closed 1 year ago

Kirill888 commented 4 years ago

When adding dataset with large number of lineage datasets we can avoid reading all the lineage datasets from the db if running in the following mode

skip_lineage=False            # DO record lineage information
verify_lineage=False          # DO NOT compare DB and yaml versions of lineage docs
fail_on_missing_lineage=True  # DO NOT add lineage docs from YAML, 
                              # expect DB to have all the lineage 

In this mode expectation is that all the lineage is already present in the database, and there is no need to verify that lineage documents as present in the DB and in the YAML document are the same.

Code here

https://github.com/opendatacube/datacube-core/blob/349dc1be1a65d1e231e3736a24983b42c6c7ba6f/datacube/index/hl.py#L146-L152

reads all the lineage datasets first, in case verification is needed later. Instead it should use bulk_has to simply verify that lineage datasets are present in the database. When indexing statistical products lineage can be very large, hundreds of datasets, extracting all of them from DB just to check if they are present is wasteful.

The big issue here is that Dataset(..sources=..) parameter that captures lineage data expects to see a list of Dataset objects and not a list of mere UUIDs. So one would have to make up a "fake" dataset object with id=<uuid>, sources=[]

Relevant code in index.datasets.add(..):

https://github.com/opendatacube/datacube-core/blob/349dc1be1a65d1e231e3736a24983b42c6c7ba6f/datacube/index/_datasets.py#L163-L171

uchchwhash commented 4 years ago

The big issue here is that Dataset(..sources=..) parameter that captures lineage data expects to see a list of Dataset objects and not a list of mere UUIDs. So one would have to make up a "fake" dataset object with id=, sources=[]

Does that quite work though? Because to create the Dataset we need to resolve what product it belongs to, but how do I do that with just the UUID at my disposal?

Kirill888 commented 4 years ago

Actually you are right, we already have dataset document, extracted from YAML sub-tree, so no need to query that from DB, what is missing is product assignment per lineage dataset, but since we won't be indexing those and must verify that they are present in the database already we can just populate them with some fake product I guess.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

omad commented 1 year ago

This has been resolved with the new PostGIS database driver, and is a wontfix for the old PostgreSQL driver.