Closed sylar11 closed 3 years ago
Just for context from our experience at Near Space Labs, we've used three internal fields for lineage, raw and parent STAC item ids:
/*
The STAC item id for the image data that this data was derived from.
The previous image data in the processing chain, if saved (convenience field,
the last item in lineage_data_ids)
*/
string parent_data_id = 10;
/*
Ordered list of all STAC item ids for images (if saved) of steps that led to this image
*/
repeated string lineage_data_ids = 11;
/*
The STAC item id for the raw data id from which this entry was derived (convenience field,
the first item in lineage_data_ids).
*/
string raw_data_id = 12;
It would be great to have this standardized.
I could see this living in the scientific extension. As it seems this field is coming originally from another standard you use, could you provide how it is defined there for reference? We fear free-text fields apart from the general description a bit in STAC as it's often not clear what exactly should be put there etc. So we want a clear description.
For "derived from", STAC has the corresponding rel type so that you can link to the items derived from.
@m-mohr I think that a good description of LINEAGE in the ISO standard can be found here https://wiki.earthdata.nasa.gov/display/NASAISO/Lineage+Information
please let me know if it is enough or if it would be better to add further information.
@m-mohr when you're mentioning the rel types you mean the rfc5829 predecessor-version
and successor-version
, correct?
I'm not sure how to apply this in an image processing chain, or maybe it's up to the user. Say I have 3 stages of data in our processing: raw, color corrected, and geo-referenced. When we process an image we save a result of each step for inspection. We run the processing chain once and save each step as v1 results (a v1 color corrected and a v1 geo-referenced). We improve our geo-referencing algorithm and run the processing again and save them as v2 results.
In STAC for our geo-referenced result v2 is it's predecessor-version
the v1 geo-referenced result or the v2 color corrected result?
No, I was referring to the "derived_from" here: https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md#relation-types Indeed, it's mostly up to the user how it's used. It's due to the fact that provenance can really get complex, thus the note under the table:
Note regarding the type derived_from: A full provenance model is far beyond the scope of STAC, and the goal is to align with any good independent spec that comes along for that. But the derived_from field is seen as a way to encourage fuller specs and at least start a linking structure that can be used as a jumping off point for more experiments in provenance tracking
Edit: I just remembered that @prateekbudhwar recently asked about Lineage, too.
Prateek Budwhwar is a colleague of mine. Some weeks ago I was asking him why an item like that was missing in STAC. That's why he contacted you @m-mohr and I decided to open this issue
@m-mohr in my opinion the "derived_from" is something different with respect to the lineage definition. Lineage is a short but comprehensive description of the scientific steps the author did to get the output he/she is describing. While, please correct me if I'm wrong, the "derived_from" item type is only a URL to the source but it does not describe what is in the middle to get the output. What do you think?
@sylar11 Yes, you are correct. STAC never had the aim to specify provenance. Thus, there's no "one way" to put the lineage description, but there are still some options you could use:
If there's broader demand, I could imagine to add a sci:lineage
(or similarly) field to the scientific extension, too.
thanks for the clarifications @m-mohr. I'm always impressed at how much STAC knowledge you possess.
For now, I made PR #881 to add a bit of best practices for lineage info.
Implemented in PR #907.
Hi, I would like to ask you if it could be possible to add the item LINEAGE as a STAC item fields. Except for raw data, data lineage is useful in every field of research and or data. It is an understanding of where data comes from and the way it has been transformed to get its present state (i.e. from raw pixel valus to ndvi using the following formula...)
Nowadays, considering the magnitude of the analyzed data, it is important to have a track of what is the analysis behind the number described by the object under analysis.
A potential place where I would look for this item would be this Scientific Extension Specification: https://github.com/radiantearth/stac-spec/tree/master/extensions/scientific.
But, in my opinion, it could also have its own place with greater visibility.