Open ijiraq opened 5 years ago
This is an issue that goes beyond the use cases and requirements that have driven CAOM development so far. The IVOA Provenance DM does cover this kind of use case, where multiple activities and entities connect an input (entity) to an output (entity).
We can do an analysis of CAOM vs Provenance DM and figure out if there is something useful we can use and whether that would entail a minor or major version.
There is work in the IVOA to formalise "one-step" or "last-step" provenance and the provenance used here toc onnect a Plane to it's inputs is definitely "last-step" provenance.
The issue here is really that this happened:
idealised : entity1 > activity1 > entity2 > activity2 > entity3
but since entity2 was not stored/kept, it's more like
actual: entity1 > {actrivity1,activity2} > entity3
so there is this composite activity; A composite activity (bunch of s/w bundled and executed together) is what people actually do (vs an idealised provenance sequence).
So, does "last-step provenance" have to capture the details of that composite activity? The reason for only having the last step is that it is simple and I'm not at all certain we can/should model a composite activity there. Pretty sure that's a bad idea.
Can easily change the cardinality of Plane.provenance.version from [0..1] to [0..*].
Need to reserve a separator that cannot be used in values (probably | like keywords) for use in relational mapping. Need to clarify that this now contains {softare name}-{version} strings and that Plane.provenance.name is now more of a logical name.
So for something simple like "used casa-5.2" one could have
name = casa
versions = casa-5.2
(it might have been version = 5.2 in CAOM 2,4)
This would NOT by itself fully specify what each s/w was used for, so in the OP
name = casa
versions = casa-XXX|casa-YYY
would not say which was used for to do step A and which was used for step B. That would require capturing something like {step}:{software} | {step}:{software}
and conveying the order/sequence of steps in the composite process... that immediately falls apart if the sequence is non-linear (fork-merge, scatter-gather, map-reduce... it happens all the time).
This sort of relates to Issue opencadc/caom2#66 and the question of cardinality.
Planes are often produced from an ensemble of software, not a single application. In the case of ALMA MS data, in particular, we have MS data that is calibrated using CASA XXX and then split using CASA YYY. The provenance of CASA YYY would tell you that you should use YYY to open these files (MS is not a standard format) but the CASA XXX part is needed to tell you what the calibration system was. In particular CASA XXX is what tells you about calibration trust while CASA YYY part is more about data form. How (if at all) should this be expressed in the provenance?