ropensci / datapack

An R package to handle data packages
https://docs.ropensci.org/datapack
44 stars 9 forks source link

Handle dc:creator in resource map properly #116

Closed gothub closed 3 years ago

gothub commented 4 years ago

When updating a resource map via uploadDataPackage, the RDF triple containing dc:creator is not handled properly. Here is an example, with PISCO resourceMap_marine_ltm.9.2 (created by the DataONE Java client library) as the existing resmap and resourceMap_marine_ltm.9.3 as the improperly serialized map (created by R dataone): resourceMap_marine_ltm.9.2:

  <rdf:Description rdf:about="https://cn.dataone.org/cn/v1/resolve/resourceMap_marine_ltm.9.2">
    <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2018-03-01T12:39:18.598-08:00</dcterms:modified>
    <ore:describes rdf:resource="https://cn.dataone.org/cn/v1/resolve/resourceMap_marine_ltm.9.2#aggregation"/>
    <rdf:type rdf:resource="http://www.openarchives.org/ore/terms/ResourceMap"/>
    <dc:creator rdf:nodeID="A0"/>
    <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">resourceMap_marine_ltm.9.2</dcterms:identifier>
  </rdf:Description>

resourceMap_marine_ltm.9.3 (just the dc:creator triple)

  <rdf:Description rdf:about="resourceMap_marine_ltm.9.2">
    <dc:creator rdf:nodeID="r1593203265r10816r1"/>
  </rdf:Description>

This latter triple causes DataONE indexing (Java Jena) to throw an exception, as the subject should be a URI (the DataONE resolve URL has been improperly stripped out), and should instead be:

  <rdf:Description rdf:about="https://cn.dataone.org/cn/v1/resolve/resourceMap_marine_ltm.9.3">
    <dc:creator rdf:nodeID="r1593203265r10816r1"/>
  </rdf:Description>

First of all, the Java client library is using 'dc:creator', which should be 'dcterms:creator'. The best solution is to remove the triple with 'dc:creator', as the R client already puts dcterms:creator in.

The R client replaces the blank node elements from original dc:creator when it creates these triples for the dcterms:creator. Here is the original and the new: resourceMap_marine_ltm.9.2(from Java client):

  <rdf:Description rdf:nodeID="A0">
    <foaf:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">DataONE Java Client Library</foaf:name>
    <rdf:type rdf:resource="http://purl.org/dc/terms/Agent"/>
  </rdf:Description>

resourceMap_marine_ltm.9.3 (from R dataone):

  <rdf:Description rdf:about="https://cn.dataone.org/cn/v2/resolve/resourceMap_marine_ltm.9.3">
    <dcterms:creator rdf:nodeID="_287a3ffd-f1db-46e0-840e-7625eb96918b"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="_287a3ffd-f1db-46e0-840e-7625eb96918b">
    <rdf:type rdf:resource="http://purl.org/dc/terms/Agent"/>
  </rdf:Description>
  <rdf:Description rdf:nodeID="_287a3ffd-f1db-46e0-840e-7625eb96918b">
    <foaf:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">DataONE R Client</foaf:name>
  </rdf:Description>

... so this doesn't need to change.

All that needs to happen is for the R dataone client (datapack) to drop the dc:creator triple.

gothub commented 3 years ago

Fixed in commit eed92cc82cd5fd3cb6835b578120af44190bb455

mbjones commented 3 years ago

For posterity, we should be using the dcterms vocabulary defined by the http://purl.org/dc/terms/ namespace, and not use the historical elementset namespace (aka dc11 defined at http://purl.org/dc/elements/1.1/) at all. This is because 1) the dcterms terms are the more modern definition and include all of the elements and more, and 2) where there are identical concepts in terms and elements, the terms concept is defined as a subproperty of the element concept. So, for example, dcterms:creator rdfs:supPropertyOf dc11:creator. So, inferencing agents can use terms concepts anywhere a dc11:creator is expected, and queries will resolve both. Which is not true in the opposite direction. StackOverflow has a great summary of these two namespaces: https://stackoverflow.com/a/47523514/4200841

That said, our parsers should be robust enough to not balk if additional properties from any other namespace are encountered. So there is likely an indexer bug in here as well, in that it is too highly sensitive to the presence of extra information.