microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
4 stars 3 forks source link

Mint & Validate versioned `WorkflowExecutionActivity` `nmdc:` IDs for workflows of same type and `OmicsProcessing` #529

Open aclum opened 1 month ago

aclum commented 1 month ago

Endpoints that accept post (/workflows_activities, json:submit) for WorkflowExecutionActivity subclasses should make sure versioning rules are being followed. The schema is validating the syntax but not that the incrementation is correct.

Expected behavior: The first time a workflow runs (unique value for was_informed_by)a workflow example identifier would be nmdc:wfrqc-11-abc1d.1, the second time a workflow is run for the same value of was_informed_by the workflow should keep the identifier through the blade and increment the ID version (ie nmdc:wfrqc-11-abc1d.2)

There is currently no validation on this and we have instances where a second run of a workflow for a value of was_informed_by mints a new ID instead of incrementing the version id.

cc @shreddd to identify someone to work on this.

PeopleMakeCulture commented 1 month ago

ref: https://microbiomedata.github.io/nmdc-schema/identifiers/#ids-minted-for-use-within-nmdc

@aclum could you clarify when a WorkflowExecutionActivity should be attached to a new version of an existing id, vs when it should have a newly minted id? Given two WorkflowExecutionActivity workflows of the same type (eg MagAnalysisActivity) and the same was_informed_by, will it:

  1. Always be the case that the second workflow is a new version of the first?
  2. Sometimes be the case that both workflows should have unique, newly minted version 1 ids?
PeopleMakeCulture commented 1 month ago

Depending on the desired behavior, we can either:

  1. Update the json validator in nmdc_runtime/util.py to check that anytime a post request is made with one or more WorkflowExecutionActivity docs, the validator looks for existing workflows with the same type and was_informed_by values, and if prior versions exist, we can enforce that the post request is made with the correct version. NOTE: This would require versions to increment in a consistent way (eg v1, v2.., not v1, v1a, v1b, v2.. etc)

  2. Determine the logic for when a workflow should get a newly minted id vs a new version number.

aclum commented 1 month ago

Given two WorkflowExecutionActivity workflows of the same type (eg MagAnalysisActivity) and the same was_informed_by, it should always increment. Ideally this would be written generically enough so it could handle the migration to berkeley so would also have to look at WorkflowChain.

PeopleMakeCulture commented 1 month ago

We should consider race conditions here...what happens if a .2 is minted but not used and another request is made, etc.

@aclum How frequently would you expect this kind of race condition would occur? Would it be acceptable for the minter to give out strictly increasing sequential version ids (1, 2, 3, etc), and for the versions in mongo to be strictly increasing, but non-sequential (1, 3, 7)?

scanon commented 1 month ago

Right now I figure this out in the scheduler. It is possible for a gap to occur but that would typically be due to some error. Does this answer the question?

dwinston commented 1 month ago

It is possible for a gap to occur but that would typically be due to some error. Does this answer the question?

@scanon if the last-submitted version of an activity is N, should the runtime accept submission of a version N+2 and subsequently reject submission of a version N+1 (prompting the submitter to mint a new ID)? That is, should the runtime enforce increment-by-1 order, or just total order?

aclum commented 1 month ago

I think we'll get into trouble if we don't acceptN+2, there will be a small percentage of errors so if the runtime increments every time there will be some version numbers skipped. This would be confusing someone looking at the identifier b/c there would be missing version numbers but I'm not sure how else to do this. The identifiers are going to be embedded in the data headers for assembly and annotation so you wouldn't want to change these after the fact if runtime rejects the submission.