ropensci / datapack

An R package to handle data packages
https://docs.ropensci.org/datapack
44 stars 9 forks source link

Visualize provenance relationships in a DataPackage #65

Open gothub opened 7 years ago

gothub commented 7 years ago

A user has requested that we add the ability to create a graphic (i.e. DAG graph) of the provenance relationships in a DataPackage. This would be useful when a DataPackage is being constructed, to verify that the prov relationships are correct. Note that it is currently possible to view the relationships by viewing the data frame returned from getRelationships(), however, a graph would be easier to view.

gothub commented 7 years ago

The output should include transformed relationships to use RDF namespaces and filenames instead of PIDs, for example, this:

sev.1.file1 cito:isDocumentedBy metadata.xml

instead of this:

urn:uuid:5f8f72b2-40b1-4da9-ba5a-b3dccf0b526f http://purl.org/spar/cito/isDocumentedBy urn:uuid:5f8f72b2-40b1-4da9-ba5a-b3dccf0b526f
gothub commented 7 years ago

Added condense param to getRelationships() which will return a version of the package relationships which uses namespace prefixes for known namespaces, and uses the filename for a DataObject instead of the identifier when possible. For a sample DataPackage, the full relationships look like this:

> getRelationships(dp)
                                        subject                                predicate                                        object subjectType
4                                    execution1           http://www.w3.org/ns/prov#used                                     scidataId        <NA>
1                                     scimetaId      http://purl.org/spar/cito/documents urn:uuid:4305b0e7-eb75-4e90-a6c3-fe103feccfb5        <NA>
2 urn:uuid:4305b0e7-eb75-4e90-a6c3-fe103feccfb5 http://purl.org/spar/cito/isDocumentedBy                                     scimetaId        <NA>
3 urn:uuid:4305b0e7-eb75-4e90-a6c3-fe103feccfb5 http://www.w3.org/ns/prov#wasDerivedFrom                                     scidataId        <NA>
5 urn:uuid:4305b0e7-eb75-4e90-a6c3-fe103feccfb5 http://www.w3.org/ns/prov#wasGeneratedBy                                    execution1        <NA>
6                                 urn:uuid:abcd      http://www.w3.org/ns/prov#startedAt                  Wed Mar 18 06:26:44 PDT 2015         uri
  objectType                             dataTypeURI
4       <NA>                                    <NA>
1       <NA>                                    <NA>
2       <NA>                                    <NA>
3       <NA>                                    <NA>
5       <NA>                                    <NA>
6    literal http://www.w3.org/2001/XMLSchema#string

and the condensed relationships would be:

  subject                predicate             object                        
4 "execution1"           "prov:used"           "scidataId"                   
1 "scimetaId"            "cito:documents"      "file3509163e759c.csv"        
2 "file3509163e759c.csv" "cito:isDocumentedBy" "scimetaId"                   
3 "file3509163e759c.csv" "prov:wasDerivedFrom" "scidataId"                   
5 "file3509163e759c.csv" "prov:wasGeneratedBy" "execution1"                  
6 "urn:uuid:abcd"        "prov:startedAt"      "Wed Mar 18 06:26:44 PDT 2015"

Added in commit 0cce119eb13a65b21241976149a94fef03ecee5a

While this is not a 'visualization' of the prov relationships, per se, it does make them easier to view and understand.

mbjones commented 7 years ago

This simplest way to do this visualization would be to convert the relationships into a data set that can be read by igraph, network, or dot packages. There's a nice tutorial online (http://kateto.net/network-visualization).

taddallas commented 7 years ago

I've made a first pass at adding the functionality to the getRelationships function by adding a plot argument, which defaults to FALSE. Please feel free to suggest changes or tweak as needed. I've added the igraph package as an import, and changed the vignette to generate the prov graph (line 272 of datapack-overview.Rmd).

See forked package

provGraph.pdf

The vertex and edge labels sometimes overlap, but I'm not sure how to programmatically solve that issue. Any help would be appreciated.

gothub commented 6 years ago

hi @taddallas thanks for the contribution! The graph you generated looks good. One suggestion - have you considered putting this plotting code into a separate function called something like 'plotRelationships'? It does make perfect sense to put this in 'getRelationships' because the plot you return is a representation of the package relationships, but if it were in a separate function, there could be arguments to control plotting parameters, and you could have the option to plot the graph immediately, or return or write out the graph to a standard graphics format.

taddallas commented 6 years ago

I've separated the plotting function, but have still included an argument in getRelationships for plotting. Feel free to remove this. Also feel free to edit the functionality and the documentation. I'm just learning the S4 object referencing (setMethod, signature, etc.) system, so there may be some mistakes in how I've set things up. See files in pull request #90

mbjones commented 6 years ago

Thanks so much! I agree with @gothub that the plotting should be its own function, and keep getRelationships to only return its data without side effects. So lets please remove the plot argument from getRelationships in favor of using the separate function.

taddallas commented 6 years ago

Sounds good. I've removed the plotting functionality and argument from getRelationships, and now plotRelationships takes a data.package object, runs getRelationships with condense=TRUE, and then visualizes the relationships using igraph.