ropensci / datapack

An R package to handle data packages
https://docs.ropensci.org/datapack
44 stars 9 forks source link

updating targetPath fails to update relationships #137

Open mbjones opened 1 year ago

mbjones commented 1 year ago

The use of prov:atLocation in packages as implemented in #76 makes use of a targetPath slot on each DataObject. When an DataObject with targetPath set is added to a package with addMember(), the function knows to add a prov:atLocation relationship to the graph for that object. However, if the targetPath of the DO is later updated, the relationships in the package remain untouched, resulting in a conflicting state. Here's some example code showing how targetPath and the relationships table can be out of sync:

library(dataone)
library(datapack)

# Create a package with a DO and a targetPath
data <- charToRaw("1,2,3\n4,5,6\n")
targetPath <- "myData/time-trials/trial_data.csv"
do <- new("DataObject", "id1", dataobj=data, "text/csv", 
          "http://orcid.org/xxxx-yyyy-zzzz", "urn:node:KNB", 
          targetPath=targetPath)
do@targetPath

# Show the targetPath for the DO
dpkg <- new("DataPackage")
dpkg <- addMember(dpkg, do)
existingDO <- getMember(dpkg, "id1")
existingDO@targetPath

# Change the targetPath and show it
dpkg <- setValue(dpkg, name="targetPath", value="newdir/new_trial_data.csv", identifiers=c("id1"))
existingDO <- getMember(dpkg, "id1")
existingDO@targetPath

# Display the relationships table for the package, note the incorrect atLocation
getRelationships(dpkg)

The fix to this might be to synchronize targetPath and relationships, such that a change to one results in an update in state on the other. For example, if a prov:atLocation relationship is added via insertRelationship() (or removed), then the appropriate targetPath should be updated (assuming it can be found -- would it be an error if that entity is not in the package?). And, if targetPath is updated using setValue() then the relationships should be updated. The challenge lies in consistently determining when one or the other has been updated, especially if they get changed through other methods than those two, or if the slots are manipulated directly.

An alternate solution might be adding an updatePath(pkg, id, path), and use that in preference to the other methods discussed above. cc @jeanetteclark

jeanetteclark commented 1 year ago

Thanks for the summary @mbjones. I see the same thing you do.

I like the updatePath method idea but we could also probably modify setValue such that if the targetPath is manipulated the relationships are updated. I like the way the function call would look for updatePath, but I'm not convinced we need a new method.

We should stop using insertRelationship I think - it is too likely to end up inserting a bogus relationship that makes the resource map invalid or otherwise non-sensical.

And I do think that it should error if the entity is not in the package - I saw an example of that yesterday, and it fails sort of silently (bagit doesn't download correctly, but package otherwise looks fine).