ropensci / datapack

An R package to handle data packages
https://docs.ropensci.org/datapack
44 stars 9 forks source link

Add high level function to add run provenance #64

Closed gothub closed 7 years ago

gothub commented 7 years ago

This feature was originally going to be implemented in the dataone package but instead has been implemented in datapack, as explained in the dataone issue https://github.com/DataONEorg/rdataone/issues/168

gothub commented 7 years ago

Implemented in commit 9afb17209134d5b5e4a3d7061daed333835f86ac

Leaving this issue open, as @mbjones has mentioned possible changes / enhancements to this function.

gothub commented 7 years ago

During the DataONE sem/prov meeting, we discussed improving the method and argument names. Possible new names mentioned for the method:

The current method signature is:

addRunProvenance(x, scriptId, inIds, outIds, resolveURI)

An improved signature would be:

addRunProvenance(x, scriptId, inputIdentifiers, outputIdentifiers, resolveURI)

or

addRunProvenance(x, scriptId, fromIdentifiers, toIdentifiers, resolveURI)

Notes for this discussion can be viewed here

mbjones commented 7 years ago

Here is other alternative inspired from our discussion earlier today:

assertDerivation(x, program=character(NA), sources, derivations, resolveURI=D1_CN_Resolve_URL)

Here, I renamed the arguments to be more descriptive and general. If the program is not provided, (is.missing(program)) then the method would devolve to calling recordDerivation() for the sources and derivations lists (i.e., no execution would be created in the prov graph).

We also discussed a version that optionally takes datapack::DataObject instances or identifiers. The signature might be the same, but in this case, before use, each member of sources, derivations, and the program would be tested for its class and, if it is a DataObject, then we use the identifier from that DataObject.

It might also make sense to reorder the arguments:

assertDerivation(x, sources, program=character(NA), derivations, resolveURI=D1_CN_Resolve_URL)

Also, we should make it so that sources or derivations could be missing. Let's discuss the options.

gothub commented 7 years ago

The method has been changed to insertDerivation(x, sources, program=as.character(NA), derivation) The resolveURI argument has been removed, as serializePackage() adds a resolveURI to each package member when the resource map is built. This URI can be passed in when uploadDataPackage() is called for example.

All the insertDerivation() behaviours mentioned in the previous post have been implemented, e.g. the program argument can be missing so input -> output derivations only are inserted; sources and derivations can be either ids or DataObjects; sources and/or derivations can be missing as well.

This update was added in commit 7cf2a3c7a050bd383e28a6d982d71c39a64bda0e

Leaving this issue open for a few days for any further comments.

mbjones commented 7 years ago

This looks great, @gothub. I reviewed the implementation last night and I think its mostly ready. There may be a few edge cases to check on (e.g., sources and derivations are passed as an empty list and program is NA), but I think you hit most cases. The main issue now is that I think we still need to do some editing work on the documentation -- its not transparent to a new user what the function does. So some wordsmithing is needed.

As a side issue, we may also want to consider what to do with recordDerivation(), which is a more limited version of what insertDerivation() does. I think that recordDerivation() is the same except that sources must be a single identifier, program is not included, and derivations is a list of length 1 or greater. It does no error checking to see if the identifiers exist, etc. So, it may be better to reimplement recordDerivation as a call to insertDerivations passing through its arguments, and then deprecate recordDerivations() for removal in a future library release. Thoughts on that?

gothub commented 7 years ago

@mbjones Thanks for the review. I added checks for the edge cases that you mentioned and updated the unit tests. I'll update the vignette to provide a complete explanation of what this method does and include more examples.

Regarding recordDerivation() - I think that what you have suggested is the best path, certainly better that trying to explain in docs when to use each of these methods, when their functionality overlaps. I'll enter a github issue for this.

gothub commented 7 years ago

@mbjones recordDerivation() allows asserting a relationship between two D1 identifiers that are not in the current DataPackage, as the example for this method shows:

dp <- recordDerivation(dp, "https://cn.dataone.org/cn/v1/object/doi:1234/_030MXTI009R00_20030812.40.1",
                     "https://cn.dataone.org/cn/v1/object/doi:1234/_030MXTI009R00_20030812.45.1")

Do we want to continue to support this?

mbjones commented 7 years ago

Yes, I htink we should support it. And note that I think insertDerivation should also support multi-package relationships as well. It will be common for a package to get its inputs from another package.

gothub commented 7 years ago

My concern about supporting connecting possibly any two D1 objects with prov:wasDerivedFrom is that this may not follow the ProvONE model, as the objects connected with this relationship are shown in the model to be connected to either an Execution or a Workflow. We have been using this relationship to connect Execution inputs and outputs - what if a user provides two package pids (metadata pids) to connect two packages. How can they reasonably determine that the pids they supply are the appropriate Execution inputs and outputs?

I'm not advocating that we don't do this at all, rarther how can this be used effectively without breaking the ProvONE model that we rely on to reason over the relationships associated with a D1 package.

mbjones commented 7 years ago

I revised and pushed new documentation for the insertDerivation function and the DataPackage class.

gothub commented 7 years ago

Now that insertDerivation supports inserting prov:wasDerivedFrom relationships between two DataONE identifier, or between a package member and a DataONE identifier, I think that work is complete on this method. This capability was added in commit a2ecbdc625fab6e271c0d0e884903bf9d0094112