Tools to facilitate serializing/parsing of RDFa in additionalMetadata

cboettig commented 10 years ago

This issue builds on the ideas discussed in #5. We can broadly separate out three use-cases for RDFa annotations:

Extending EML. If we want to provide additional machine-readable / structured metadata that cannot be expressed in EML, we can always add this information in an additionalMetadata section. (Need not be semantic, could just be XML). This is already illustrated in #50.
Providing semantic versions of EML terms that can be understood by more generic tools. We could duplicate information that is already specified elsewhere in the EML, but express it here in RDFa (e.g. dc:title, dc:creator, etc). This would allow a generic RDFa distiller to extract the metadata into a triples library where it could be easily queried with SPARQL tools. On the other hand, this feels a bit like a hack -- perhaps an XSLT conversion of (some subset of?) EML to RDF would make more sense?
Adding semantic meaning to EML metadata fields that are currently expressed only in free-form text. This is the real use case from #5 and #8, allowing us to provide semantic definitions of units and measurements such that we can reason with them, e.g.

<additionalMetadata>
     <describes>1838</describes> <!--reference the attribute's id-->
     <metadata>
      <subject about="http://some.namespace#1838" xmlns:o="http:/oboe-core#">
          <meta property="o:entity" content="Air" datatype="xsd:string"/>
          <meta property="o:characteristic" content="Temperature" datatype="xsd:string"/>
          <meta property="o:unit" content="Celsius" datatype="xsd:string"/>
      </subject>
    </metadata>
  </additionalMetadata>

emhart commented 10 years ago

Is the idea here that we embed the semantics in the meta tags? Or would be be able reference the semantics elsewhere in a separate file?

cboettig commented 10 years ago

@emhart Both. The above gets embedded in the EML in an additionalMetadata element. Then we can parse it as XML content when we read.eml, or we can pipe the whole darned EML file through an RDFa distiller, and out will come an RDF version of this metadata (in whatever format we want: N3, turtle, etc, but we'll use RDF-XML to illustrate). We can then explore that with whatever semantic tools we have handy to chew on RDF. For instance, I illustrate both XML parsing and RDF SPARQL queries on a minimal EML file in this example/test:

https://github.com/ropensci/reml/blob/0ac91203026779f3a89d5cc42470faaab879a82a/inst/tests/test_semantics.R (updated link, fixed first xpath query)

enjoy!

mbjones commented 10 years ago

@emhart The semantics are implied due to the namespace association. The "o:" prefix is linked to the OBOE namespace, which defines the semantics of the properties.

@cboettig Regarding your second point, we have an XSLT as part of Metacat that can do a minimal EML->RDF translation, which was used in some early work on supporting LSIDs. Its minimal, but a decent starting point.

mbjones commented 10 years ago

@emhart After re-reading your comment, I think I misinterpreted your question.

cboettig commented 10 years ago

@mbjones awesome. yeah, minimal is fine, it might be a nice proof-of-concept to include along with our other stylesheets and see what users find most useful. Can you link me to the XSLT? With an increasing number of resources being available in RDF that particular case becomes more compelling. Doesn't dataone have an associated triplestore? At this stage I imagine the SOLR queries are more useful, but who knows.

Also, @emhart might have mentioned to you that he and I have had some discussions about his work at NEON and in providing EML for some of their data products. It sounds like a semantically enhanced EML could be particularly promising in that case.

I'm still getting my head wrapped around SPARQL queries but the ability to do these from R, as in my example linked above, is a nice touch.

mbjones commented 10 years ago

@cboettig As I said, its rudimentary, and is no longer maintained, so its out of date wrt the current EML version. But you can find the XSLT here: https://code.ecoinformatics.org/code/metacat/trunk/lib/lsid_conf/eml-2.0.1.xslt

cboettig commented 10 years ago

Okay, decided I might learn some really basic XSLT by writing a style file to pull some standard dublin core-type terms from /eml/dataset and provide them as RDF. No idea if this implementation would really be best-practice, but fun learning exercise anyhow.

require(Sxslt)
infile <- system.file("examples", "hf205.xml", package = "reml")
xsltApplyStyleSheet(infile, "inst/xsl/eml211_to_rdf.xsl")

[ ] Not clear how I can get the dc: namespace definition to appear in one of the parent nodes, XSLT adds it to each dc: prefixed element explicitly.
[ ] Not sure why my xsl isn't getting the packageId for the about attribute.

Could easily be extended and could no doubt be improved upon.

mbjones commented 10 years ago

FYI, we also have a minimal EML -> DC XSLT that is used to produce DC RDF for our Metacat OAI-PMH implementation. See https://code.ecoinformatics.org/code/metacat/trunk/lib/oaipmh/

Might be useful to you for comparison.

cboettig commented 10 years ago

@mbjones those are beautiful! Yes, nice to how that's done. I think these stylesheets could be potentially useful to reml users seeing to do some triples extraction and manipulation of a bunch of xml with the rrdf package, for instance, so I'll include them in the xsl collection.

Ultimately would be nice to include things beyond the Dublin Core to where we can do more with the semantics (I realize that's outside of the OAI-PMH use case for the XSLT you link, but just as an extension). For instance, it might be nice to add taxonomicCoverage in terms such as the VTO. Then one could construct a sparql query to say things like "give me all datasets covering frogs".

ropensci / EML

Tools to facilitate serializing/parsing of RDFa in additionalMetadata #60