radiantearth / stac-spec

SpatioTemporal Asset Catalog specification - making geospatial assets openly searchable and crawlable
https://stacspec.org
Apache License 2.0
782 stars 178 forks source link

LINEAGE field for provenance #863

Closed sylar11 closed 3 years ago

sylar11 commented 4 years ago

Hi, I would like to ask you if it could be possible to add the item LINEAGE as a STAC item fields. Except for raw data, data lineage is useful in every field of research and or data. It is an understanding of where data comes from and the way it has been transformed to get its present state (i.e. from raw pixel valus to ndvi using the following formula...)

Nowadays, considering the magnitude of the analyzed data, it is important to have a track of what is the analysis behind the number described by the object under analysis.

A potential place where I would look for this item would be this Scientific Extension Specification: https://github.com/radiantearth/stac-spec/tree/master/extensions/scientific.

But, in my opinion, it could also have its own place with greater visibility.

davidraleigh commented 4 years ago

Just for context from our experience at Near Space Labs, we've used three internal fields for lineage, raw and parent STAC item ids:

    /*
    The STAC item id for the image data that this data was derived from. 
    The previous image data in the processing chain, if saved (convenience field, 
    the last item in lineage_data_ids)
    */
    string parent_data_id = 10;

    /*
    Ordered list of all STAC item ids for images (if saved) of steps that led to this image
    */
    repeated string lineage_data_ids = 11;

    /*
    The STAC item id for the raw data id from which this entry was derived (convenience field, 
    the first item in lineage_data_ids).
    */
    string raw_data_id = 12;

It would be great to have this standardized.

m-mohr commented 4 years ago

I could see this living in the scientific extension. As it seems this field is coming originally from another standard you use, could you provide how it is defined there for reference? We fear free-text fields apart from the general description a bit in STAC as it's often not clear what exactly should be put there etc. So we want a clear description.

For "derived from", STAC has the corresponding rel type so that you can link to the items derived from.

sylar11 commented 4 years ago

@m-mohr I think that a good description of LINEAGE in the ISO standard can be found here https://wiki.earthdata.nasa.gov/display/NASAISO/Lineage+Information

please let me know if it is enough or if it would be better to add further information.

davidraleigh commented 4 years ago

@m-mohr when you're mentioning the rel types you mean the rfc5829 predecessor-version and successor-version, correct?

I'm not sure how to apply this in an image processing chain, or maybe it's up to the user. Say I have 3 stages of data in our processing: raw, color corrected, and geo-referenced. When we process an image we save a result of each step for inspection. We run the processing chain once and save each step as v1 results (a v1 color corrected and a v1 geo-referenced). We improve our geo-referencing algorithm and run the processing again and save them as v2 results.

In STAC for our geo-referenced result v2 is it's predecessor-version the v1 geo-referenced result or the v2 color corrected result?

m-mohr commented 4 years ago

No, I was referring to the "derived_from" here: https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md#relation-types Indeed, it's mostly up to the user how it's used. It's due to the fact that provenance can really get complex, thus the note under the table:

Note regarding the type derived_from: A full provenance model is far beyond the scope of STAC, and the goal is to align with any good independent spec that comes along for that. But the derived_from field is seen as a way to encourage fuller specs and at least start a linking structure that can be used as a jumping off point for more experiments in provenance tracking

Edit: I just remembered that @prateekbudhwar recently asked about Lineage, too.

sylar11 commented 4 years ago

Prateek Budwhwar is a colleague of mine. Some weeks ago I was asking him why an item like that was missing in STAC. That's why he contacted you @m-mohr and I decided to open this issue

sylar11 commented 4 years ago

@m-mohr in my opinion the "derived_from" is something different with respect to the lineage definition. Lineage is a short but comprehensive description of the scientific steps the author did to get the output he/she is describing. While, please correct me if I'm wrong, the "derived_from" item type is only a URL to the source but it does not describe what is in the middle to get the output. What do you think?

m-mohr commented 4 years ago

@sylar11 Yes, you are correct. STAC never had the aim to specify provenance. Thus, there's no "one way" to put the lineage description, but there are still some options you could use:

If there's broader demand, I could imagine to add a sci:lineage (or similarly) field to the scientific extension, too.

davidraleigh commented 4 years ago

thanks for the clarifications @m-mohr. I'm always impressed at how much STAC knowledge you possess.

m-mohr commented 4 years ago

For now, I made PR #881 to add a bit of best practices for lineage info.

m-mohr commented 3 years ago

Implemented in PR #907.