Support lighter-weight lineage tracking in `dataset add`

Kirill888 commented 6 years ago

Currently datacube dataset add expects that metadata document for dataset being indexed includes full metadata document for every dataset in its direct lineage, and these in turn, recursively, include all the metadata documents for their lineage and so on. This approach has certain data integrity, and data distribution advantages and works relatively well for simple transforms, like scene->nbar_scene->nbar_albers->fc. However, it becomes quite cumbersome for full-history statistical products like geometric median, metadata document grows really large quickly.

I propose we allow uuid-only lineage to be accepted as an alternative. This still records full lineage of a generated dataset at the time of computation, but relies on access to the original database to translate uuids to actual metadata.

id: <uuid>
lineage:
  sources:
     a:
       id: <uuid>
       sources:
         b:
           id: <uuid>
           sources: {}
     c:
      id: <uuid>
      sources: {}

Essentially this encodes an N-way tree (constructed from Directed Acyclic Graph, which is the real lineage dependency structure), where each node has id property and a dictionary of named sources that contain more nodes. Use of dictionary instead of just a list is to be consistent with the way we capture this information currently.

When you add dataset with lightweight lineage data to the datacube the expectation is that all referenced datasets are already present in the database.

mpaget commented 6 years ago

I Support the initiative and proposal. First questions I have are around uuid integrity:

check/alert if a uuid is valid
check/return existing uuids, so a user can check uuids they want to include
delete uuids, possibly in a recursive manner
view uuid trees.

I suspect you have already thought about these, and the initial comment demonstrates this by specifically touching on a few of these points. The above points don't need to be fully developed before progressing this issue, rather I contribute them as likely aspects that will need to be considered down the track so early solutions should be aware of them.

Kirill888 commented 6 years ago

@mpaget we certainly need improve/extend UI for dealing with lineage data, see #451, #398 for example. For GA needs on NCI, we would need following behaviour

For dataset to be added to the index:

Direct ancestors should already exist in the DB
Their lineage is "compatible" with lineage recorded in the dataset metadata

The definition of "compatible" will need to be worked out, options are:

a. Exactly the same b. Is a super set (i.e. new lineage was added to what's recorded in DB, that wasn't at the time of computation, but nothing was deleted) c. Anything goes, i.e. grandparents were removed from parents lineage

Probably not (c) but maybe (a) or (b).

Allowing more flexible rules for other users of ODC could certainly be considered. However this is probably predicated on ODC supporting "external lineage", where uuids are recorded for future reference, but they don't map to any dataset in this DB, but could be possibly looked up through some external service that aggregates metadata info across multiple "well-known" datacubes.

pbranson commented 6 years ago

Might be a stupid suggestion, but could the metadata be compressed instead using zlib? Keeps the lineage in place if the performance overhead isn't too great

On Thu., 31 May 2018, 11:08 am Matt Paget, notifications@github.com wrote:

I Support the initiative and proposal. First questions I have are around uuid integrity:

check/alert if a uuid is valid

check/return existing uuids, so a user can check uuids they want to include

delete uuids, possibly in a recursive manner

view uuid trees. I suspect you have already thought about these, and the initial comment demonstrates this by specifically touching on a few of these points. The above points don't need to be fully developed before progressing this issue, rather I contribute them as likely aspects that will need to be considered down the track so early solutions should be aware of them.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opendatacube/datacube-core/issues/471#issuecomment-393369413, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3bQDKv5rajdOVKhEPHe_17e01VIAGWks5t30KGgaJpZM4UUPO5 .

Kirill888 commented 6 years ago

@pbranson you quite right to suggest compression for dataset metadata, datacube-stats has always compressed metadata, and we recently fixed oversight of not enabling metadata compression in ingestor and stacker #452.

But cost in bytes stored while important is not a main driving factor for this proposal. Before this metadata can become compressed bytes it needs to be pulled out of the database into a python structure, this takes time and memory and puts load on the database. I'm not suggesting that we stop supporting full lineage document embedding, this will always be supported, I'm just proposing a lighter version to speed up new product development.

For every dataset file that gets eventually published there are many more that were generated in the process of tuning algorithm parameters or fixing bugs in the implementation. For complex algorithms that depend on many input sources and perform some kind of temporal reduction these costs add up, slowing down algorithm iteration, making people wait, not just making computers do more work.

Once the results are good enough to be published, there is enough information to expand into a full lineage document to be embedded in the golden copy.

pbranson commented 6 years ago

Thanks for the detailed reply Kirill, makes sense

On Fri., 1 Jun. 2018, 8:16 am Kirill Kouzoubov, notifications@github.com wrote:

@pbranson https://github.com/pbranson you quite right to suggest compression for dataset metadata, datacube-stats https://github.com/GeoscienceAustralia/datacube-stats has always compressed metadata, and we recently fixed oversight of not enabling metadata compression in ingestor and stacker #452 https://github.com/opendatacube/datacube-core/issues/452.

But cost in bytes stored while important is not a main driving factor for this proposal. Before this metadata can become compressed bytes it needs to be pulled out of the database into a python structure, this takes time and memory and puts load on the database. I'm not suggesting that we stop supporting full lineage document embedding, this will always be supported, I'm just proposing a lighter version to speed up new product development.

For every dataset file that gets eventually published there are many more that were generated in the process of tuning algorithm parameters or fixing bugs in the implementation. For complex algorithms that depend on many input sources and perform some kind of temporal reduction these costs add up, slowing down algorithm iteration, making people wait, not just making computers do more work.

Once the results are good enough to be published, there is enough information to expand into a full lineage document to be embedded in the golden copy.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opendatacube/datacube-core/issues/471#issuecomment-393698617, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3bQJRP7GKm-Oz5YiT3gEzUUaYQhTPBks5t4GvNgaJpZM4UUPO5 .

alexgleith commented 5 years ago

Is there any downside to using proper database relationships rather than either ID references without relational key enforcement or the current (not ideal) method of repeating the entire chain of metadata?

Kirill888 commented 5 years ago

@alexgleith we do use "proper database relationships", in the database. We also dump those relationships to disk when generating new datasets. The proposal was to just dump relations and not full metadata of the "parent" datasets. With EO3 style metadata we already do what this thing was proposing.

I would leave this open until we move EO3 support out of dea-proto into datacube-core

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

opendatacube / datacube-core

Support lighter-weight lineage tracking in `dataset add` #471