An idea for "forward" provenance

hgs-sellis commented 2 years ago

This issue is to explain and discuss one possible way of handling what I am calling "forward" provenance. I'm not sure how general this solution is, but it seems to meet the needs of my company right now, and I wanted to get it in front of the STAC community and maybe foster some discussion.

One can think of provenance as going "backward" and "forward." That is, if you have a derived item in hand, you might want to go "backward" to its source items. Conversely, if you have a source item in hand, you might want to go "forward" to the items that were derived from it.

In the Processing Extension, the STAC spec already addresses the backward case. I.e. "It is recommended to add one or more links with derived_from or via relationships to the source metadata & data used in the processing. They could be used to trace back the processing history of the dataset."

In the examples below, the use case is Change Detection, where Collection1/Item1 and Collection2/Item2 are used as inputs to produce the derived item Collection3/Item3.

So, to support backward provenance, the spec says that derived items should have links that go back to their sources. E.g. backward_provenance

One benefit of this approach is that when you have a derived item in hand, you immediately have links to all of its sources. Another is that it is easy to add the proper backward links when the derived item is created.

Now, how might we solve the forward case? One option might be to simply turn the existing pattern around, and put forward links in the source items. E.g. forward_provenance_links

In this example, I have invented the new link relation source_for to indicate forward provenance, the opposite of derived_from. This design has some benefits, but also some pretty big problems. The benefit is that when you have a source item in hand, you immediately have links to all of its derived items. One problem is that when you create the derived item, you also have to update the (possibly multiple) source items to add the forward links. This adds a lot of complexity. In a RESTful environment, the STAC server must use ETags and optimistic locking to prevent the lost update problem. So updating an item can be a fairly complex operation that complicates the client and should be avoided when possible. It also means that the provenance information we need to go forward and backward is now spread across all of the STAC items involved.

This is why I came up with the following design for forward provenance using a property called "derived_from_items." The links section is not searchable, but the properties are. If we duplicate a small amount of information from the links section into this new property (the collection ids and item ids of the source items), then we can easily find all of the items derived from a given source. E.g. forward_provenance_property_search

Now, if I have Collection1/Item1 in hand, I can do the following search (using a STACQL query):

POST /stac/search
{
    "query": {
        "derived_from_items": {
            "in": ["1:1"] // <collectionId><delimiter><itemId>
        }
    }
}

That search will return all of the items that have "1:1" in their "derived_from_items" property, and those are all of the items derived from the source item I have in hand. Same for item Collection2/Item2 when I do a search for "2:2". A big benefit here is that we don't have to update the source items at all when a derived item is created. All of the provenance information we need to go forward and backward is isolated just in the derived item.

Thoughts on this approach? Maybe this solution is general enough to go in the Processing Extension? Any and all feedback is welcome!

Thanks, Scott

jamesvrt commented 2 years ago

Sounds very useful for automatically building provenance graphs.

One thought: could this be handled as an extension for /search in stac-api? Rather than new STAC properties, instead provide a way to search links by rel, collection and item.

Maybe the answer to that comes down to: Is the preference in the STAC community for extending the STAC itself or extending stac-api capabilities?

hgs-sellis commented 2 years ago

@jamesvrt Thanks for the feedback!

I raised this question at the last bi-weekly STAC meeting and there was general consensus that trying to search the links section for forward provenance is probably too complicated compared to the property search approach described here.

@cholmes mentioned that the derived_from links currently in the spec were not fully thought out. It is nice to have them, but maybe source items should be in a searchable property too? That would provide symmetry/consistency with the design proposed here.

jamesvrt commented 2 years ago

Sounds good then. I know I'd benefit from this property for data pipelining and feature engineering workflows.

m-mohr commented 1 year ago

This is a relation type that should be added to the processing extension. See https://github.com/stac-extensions/processing/issues/27

radiantearth / stac-spec

An idea for "forward" provenance #1179