Closed Kirill888 closed 1 year ago
The big issue here is that Dataset(..sources=..) parameter that captures lineage data expects to see a list of Dataset objects and not a list of mere UUIDs. So one would have to make up a "fake" dataset object with id=
, sources=[]
Does that quite work though? Because to create the Dataset
we need to resolve what product it belongs to, but how do I do that with just the UUID at my disposal?
Actually you are right, we already have dataset document, extracted from YAML sub-tree, so no need to query that from DB, what is missing is product assignment per lineage dataset, but since we won't be indexing those and must verify that they are present in the database already we can just populate them with some fake product I guess.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This has been resolved with the new PostGIS database driver, and is a wontfix for the old PostgreSQL driver.
When adding dataset with large number of lineage datasets we can avoid reading all the lineage datasets from the db if running in the following mode
In this mode expectation is that all the lineage is already present in the database, and there is no need to verify that lineage documents as present in the DB and in the YAML document are the same.
Code here
https://github.com/opendatacube/datacube-core/blob/349dc1be1a65d1e231e3736a24983b42c6c7ba6f/datacube/index/hl.py#L146-L152
reads all the lineage datasets first, in case verification is needed later. Instead it should use
bulk_has
to simply verify that lineage datasets are present in the database. When indexing statistical products lineage can be very large, hundreds of datasets, extracting all of them from DB just to check if they are present is wasteful.The big issue here is that
Dataset(..sources=..)
parameter that captures lineage data expects to see a list ofDataset
objects and not a list of mereUUID
s. So one would have to make up a "fake" dataset object withid=<uuid>, sources=[]
Relevant code in
index.datasets.add(..)
:https://github.com/opendatacube/datacube-core/blob/349dc1be1a65d1e231e3736a24983b42c6c7ba6f/datacube/index/_datasets.py#L163-L171