Closed gothub closed 7 years ago
Implemented in commit 9afb17209134d5b5e4a3d7061daed333835f86ac
Leaving this issue open, as @mbjones has mentioned possible changes / enhancements to this function.
During the DataONE sem/prov meeting, we discussed improving the method and argument names. Possible new names mentioned for the method:
The current method signature is:
addRunProvenance(x, scriptId, inIds, outIds, resolveURI)
An improved signature would be:
addRunProvenance(x, scriptId, inputIdentifiers, outputIdentifiers, resolveURI)
or
addRunProvenance(x, scriptId, fromIdentifiers, toIdentifiers, resolveURI)
Notes for this discussion can be viewed here
Here is other alternative inspired from our discussion earlier today:
assertDerivation(x, program=character(NA), sources, derivations, resolveURI=D1_CN_Resolve_URL)
Here, I renamed the arguments to be more descriptive and general. If the program
is not provided, (is.missing(program)
) then the method would devolve to calling recordDerivation()
for the sources
and derivations
lists (i.e., no execution would be created in the prov graph).
We also discussed a version that optionally takes datapack::DataObject
instances or identifiers. The signature might be the same, but in this case, before use, each member of sources
, derivations
, and the program
would be tested for its class and, if it is a DataObject
, then we use the identifier from that DataObject.
It might also make sense to reorder the arguments:
assertDerivation(x, sources, program=character(NA), derivations, resolveURI=D1_CN_Resolve_URL)
Also, we should make it so that sources
or derivations
could be missing.
Let's discuss the options.
The method has been changed to insertDerivation(x, sources, program=as.character(NA), derivation)
The resolveURI
argument has been removed, as serializePackage()
adds a
resolveURI to each package member when the resource map is built. This URI can be passed
in when uploadDataPackage()
is called for example.
All the insertDerivation()
behaviours mentioned in the previous post have been implemented, e.g.
the program
argument can be missing so input -> output derivations only are inserted; sources
and derivations
can be either ids or DataObjects; sources
and/or derivations
can be missing as well.
This update was added in commit 7cf2a3c7a050bd383e28a6d982d71c39a64bda0e
Leaving this issue open for a few days for any further comments.
This looks great, @gothub. I reviewed the implementation last night and I think its mostly ready. There may be a few edge cases to check on (e.g., sources and derivations are passed as an empty list and program is NA), but I think you hit most cases. The main issue now is that I think we still need to do some editing work on the documentation -- its not transparent to a new user what the function does. So some wordsmithing is needed.
As a side issue, we may also want to consider what to do with recordDerivation()
, which is a more limited version of what insertDerivation()
does. I think that recordDerivation()
is the same except that sources must be a single identifier, program is not included, and derivations is a list of length 1 or greater. It does no error checking to see if the identifiers exist, etc. So, it may be better to reimplement recordDerivation as a call to insertDerivations
passing through its arguments, and then deprecate recordDerivations()
for removal in a future library release. Thoughts on that?
@mbjones Thanks for the review. I added checks for the edge cases that you mentioned and updated the unit tests. I'll update the vignette to provide a complete explanation of what this method does and include more examples.
Regarding recordDerivation()
- I think that what you have suggested is the best path, certainly
better that trying to explain in docs when to use each of these methods, when their functionality
overlaps. I'll enter a github issue for this.
@mbjones recordDerivation()
allows asserting a relationship between two D1 identifiers that are not in the current DataPackage, as the example for this method shows:
dp <- recordDerivation(dp, "https://cn.dataone.org/cn/v1/object/doi:1234/_030MXTI009R00_20030812.40.1",
"https://cn.dataone.org/cn/v1/object/doi:1234/_030MXTI009R00_20030812.45.1")
Do we want to continue to support this?
Yes, I htink we should support it. And note that I think insertDerivation
should also support multi-package relationships as well. It will be common for a package to get its inputs from another package.
My concern about supporting connecting possibly any two D1 objects with prov:wasDerivedFrom
is that this may not follow the ProvONE model, as the objects connected with this relationship are shown in the model to be connected to either an Execution or a Workflow. We have been using this relationship to connect Execution inputs and outputs - what if a user provides two package pids (metadata pids) to connect two packages. How can they reasonably determine that the pids they supply are the appropriate Execution inputs and outputs?
I'm not advocating that we don't do this at all, rarther how can this be used effectively without breaking the ProvONE model that we rely on to reason over the relationships associated with a D1 package.
I revised and pushed new documentation for the insertDerivation
function and the DataPackage
class.
Now that insertDerivation
supports inserting prov:wasDerivedFrom
relationships between two
DataONE identifier, or between a package member and a DataONE identifier, I think that work
is complete on this method. This capability was added in commit a2ecbdc625fab6e271c0d0e884903bf9d0094112
This feature was originally going to be implemented in the
dataone
package but instead has been implemented in datapack, as explained in thedataone
issue https://github.com/DataONEorg/rdataone/issues/168