microbiomedata / metaG

End-to-end metaG workflow (WIP)
1 stars 1 forks source link

Tracking Automation Changes #5

Open scanon opened 2 years ago

scanon commented 2 years ago
Michal-Babins commented 2 years ago

I think there is also the issue of tracking metadata and have automatic updates as it goes through its "data processing journey". Before we can even begin to process data, we need to stage the data so it collects the appropriate data files and metadata needed to populate descriptors in the processing steps (i.e "id", "type", "resource", etc), and as the processing begins to when it finishes, more metadata is added to the pool. What version of the workflow are we running? What data products were produced? However, all of this data is eventually interconnected by its labels, like original id and nmdc activity id. The question I pose is, how do we best manage that data so it reflects this fluid processing mechanics? Do we need a set of runtime apis that allows to communicate and update the "data tracking" as it occurs. All the data we use at the beginning to populate our fields for staging and starting processing, but by the time we ingest is still relevant plus all the metadata collected along the way. Hope this helps to clarify some of the intrinsic complexities that if resolved properly, could allow for easier management of updates and versioning.