traitecoevo / APD

The Australian Plant Traits Dictionary
https://traitecoevo.github.io/APD/
4 stars 2 forks source link

minor patches for RDF compliance #3

Closed cboettig closed 1 year ago

cboettig commented 1 year ago

Nice work @ehwenk , this looks great. I tweaked the R code lightly for some minor RDF issues.

Instead of writing to csv, we then write the three columns in the n-quads serialization. (ok, technically four columns, "quads" adds a "graph" column, which is simply ".", meaning all these triples are part of the same "graph").

This looks nearly good. One remaining thing is that n-quads, being a trivially simple format, doesn't support prefixes, so it looks like the <xsd:string> URIs will have to be expanded to use absolute URLs instead. Not sure if there are any other prefixes.

I added a few 'smoke test' SPARQL queries at the end. SPARQL is kinda like SQL, but supports this cool trick where you can create that let you walk the graph. You probably won't use it but it can be kinda cool, see examples at end of R file.

cboettig commented 1 year ago

@ehwenk updated the PR above to handle UTF-8 chars via unicode encoding, which is a bit ugly maybe but lossless, e.g.

true_triples <- read_nquads("data/ADP.nq")

unescape_unicode <- function(x) {
  stringi::stri_unescape_unicode(gsub("<U\\+(....)>", "\\\\u\\1", x))
}

# example query
sparql <-
'SELECT DISTINCT ?orcid ?label
 WHERE { ?s <http://purl.org/datacite/v4.4/IsReviewedBy> ?orcid .
         ?orcid <http://www.w3.org/2000/01/rdf-schema#label> ?label
       }
'
rdf_query(true_triples, sparql) %>%
  mutate(label = unescape_unicode(label)) # replace unicode with proper accented characters
dfalster commented 1 year ago

Thanks @cboettig - great inputs!

@ehwenk I'll leave you to merge PRs when ready!