ropensci / datapack

An R package to handle data packages
https://docs.ropensci.org/datapack
44 stars 9 forks source link

Remove dependency on redland `getNextResult` #110

Closed gothub closed 4 years ago

gothub commented 4 years ago

The getNextResult() function has been removed from the redland R package. This function is used by the datapack getTriples() function.

In order to remove this dependency, redland getResults() will be called and the resulting rdf/xml XML document will be parsed.

Implementation Note: returning a CSV result and converting to a data.frame is way more convenient, but the conversion to CSV by the redland C library drops alot of the RDF typing information. For example, the RDF/XML version allows a program to distinguish between a RDF blank node and an RDF uri node, e.g.

rdf:nodeID="r1570556175r60221r1"

vs

rdf:resource="https://cn.dataone.org/cn/v2/resolve/urn%3Auuid%3A615206e1-e172-43e7-99ec-3de618690460"

Typing the nodes correctly has an effect on parsing and the output of the relationships.

gothub commented 4 years ago

Dependency on redland::getNextResult() removed in commit de4ad5a4fcd8849b7971d36fda22f8da4e9fb922

The getTriples() function was tested via the test.ResourceMap.R unit test. Other tests included parsing large DataONE resource maps with getTriples() and comparing the difference between these triple sets (via output .csv files) betweeen datapack 1.3.1 and 1.3.2 versions. The only difference was that for some reason the redland C library call used in redland::getResults() removes the leading _ in blank node names. It does this consistently, so the resulting RDF/XML files are valid.