Integrate two datatables based on EML spec and ontology

cboettig commented 11 years ago

This is the holy grail of metadata infrastructure and ostensibly the primary purpose of EML, see Jones et al 2006. Despite that, integration is not actually possible without semantic definitions as well, see Michener & Jones 2012, from which we adapt this minimal example below.

This example provides minimal and sometimes missing semantics; which may make it unresolvable. A complete semantic solution is diagrammed in the figure from Michener & Jones 2012.

Dataset 1

 dat = data.frame(river=c("SAC", "SAC", "AM"), 
                   spp = c("king", "king", "ccho"), 
                   stg = c("smolt", "parr", "smolt"),
                   ct =  c(293L, 410L, 210L))

 col_metadata = c(river = "http://dbpedia.org/ontology/River",
                  spp = "http://dbpedia.org/ontology/Species", 
                  stg = "Life history stage",
                  ct = "count")

 unit_metadata = 
  list(river = c(SAC = "The Sacramento River", AM = "The American River"),
       spp = c(king = "King Salmon", ccho = "Coho Salmon"),
       stg = c(parr = "third life stage", smolt = "fourth life stage"),
       ct = "number")

Dataset 2

 dat = data.frame(site = c("SAC", "AM", "AM"), 
                   species = c("Chinook", "Chinook", "Silver"), 
                   smct = c(245L, 511L, 199L),
                   pcnt =  c(290L, 408L, 212L))

 col_metadata = c(site = "http://dbpedia.org/ontology/River",
                  species = "http://dbpedia.org/ontology/Species", 
                  smct = "Smolt count",
                  pcnt = "Parr count")

 unit_metadata = 
  list(river = c(SAC = "The Sacramento River", AM = "The American River"),
       spp = c(Chinook = "King Salmon", Silver = "Coho Salmon"),
       smct = "number",
       pcnt = "number")

Figure

ontology_synthesis2

mbjones commented 11 years ago

See comments on this in issue #5 .

cboettig commented 11 years ago

The approaches to semantics you outline in #5 sound promising: I see how annotations (e.g. this one) map onto the eml (e.g. this one), though I don't spot the corresponding lines in the EML that map to the annotation?

More generally I am curious about how the user would specify semantics; it seems we might always have a variety of ways to actually implement them in. My understanding of semantics is pretty limited, but I imagine the general idea would be to provide URIs in place of definitions. In an ideal linked data world, it wouldn't matter if we all used the same URIs for species (e.g. ITIS TaxanomicTypeName) since that could be resolved... If the user could manage to specify the URIs, we can then figure out behind the scenes whether we write that to a definition node directly or do something more intelligent like the examples you point to.

Of course I image the first problem is both having URIs for terms users want and helping them discover them.

I suppose for the moment this might be out of the scope of reml...

cboettig commented 11 years ago

I keep thinking about this integration question and trying to wrap my head around the different kinds of semantics involved here. Let's see if I got this right:

We have a vocabulary defined by the EML Schema, which can give semantic meaning to things (e.g. we have a precise notion of the term "genus", the units "gram" and "kilogram", etc), but it is not an ontology (e.g. OWL), so we don't have access to the richer reasoning tools and infrastructure thereof.

We can use the schema definitions (e.g. 'coverage' nodes) to annotate attributes using id and <reference> as described in #9, but this is not commonly done. This would also be weaker than providing ontological definitions of terms, ultimately needed to do the synthesis described at the top of this issue.

So instead, we can annotate EML with the approach @mbjones describes in #5, in which an external XML file provides ontological descriptions of the nodes, as illustrated in some examples. It seems like this is the way forward, given the current EML schema.

@mbjones Are you familiar with the NeXML standard, e.g. as described by Vos et al. 2012

All elements in a NeXML document—branches and nodes in trees, cells in a matrix, OTUs, and so on—can be identified and given annotations using a generalized system that allows for simple values as well as complex, structured information such as geo-references, taxon concepts, or character-state descriptions. Moreover, data elements can be declared as instances of a class defined in an ontology, making the semantics of the data themselves computable.

@mbjones It seems like they have a more direct way of accomplishing this goal; e.g. a tighter correspondence between Schema's vocabulary and available ontologies? Is this at all instructive for us? There are a lot of shared objectives here -- e.g .attaching geo-references and taxon concepts to nodes -- it seems like a common approach here would be good. Perhaps your working groups are already talking to each other?

mbjones commented 11 years ago

I have a passing familiarity with NexML, having had to deal with it in Kepler, but what you describe sounds useful. The key in all of these is to have a solid, well-defined identifier for anything that you want to apply an annotation to. The EML id attribute is one of these, and is how we implemented the semtools annotations that I described in #5. Although I said that people don't often apply geospatial and taxon constraints to particular attributes, that is how we intended for the system to work. The additionalMetadata <describes> element provides a general purpose way of annotating any subtree in an EML document. I think this is parallel to what NexML provides though their <Annotated> complex type, with its <about> attribute pointing at a URI. So, the reason for us to do annotations separately in Semtools is exactly this -- there are many metadata standards, and each has its own way describing entities and attributes. The external annotation schema that I cited in #5 provides a mechanism to link annotations to ontologies that is flexible enough to apply to multiple different metadata schemas. It could be inlined inside of an EML additionalMetadata element for ease of use, or it could stand alone as its own independent document. My impression is that NexML assumes these will always be external, as the <Annotated> element uses the about URI attribute for the pointer.

We have not talked with the NexML folks, but sounds like we should. Do you know that there will be a concentrated emphasis on this via a biodiversity semantics workshop at TDWG this year that Mark Schildhauer is organizing? We also have some work on this coming up via our Semtools project, so hopefully we can all come to an acceptable shared approach.

cboettig commented 11 years ago

@mbjones From @rvosa I understand that NeXML provides this kind of annotation in <meta> nodes as child nodes to the attributes, e.g. here's an excerpt from a list of <node> elements where one has such an annotationL

                        <node id="tree2n2" label="n2" otu="t1"/>
            <node id="tree2n3" label="n3"/>
            <node id="tree2n4" about="#tree2n4" label="n4">
                <meta 
                    id="tree2dict1" 
                    property="cdao:has_tag" 
                    content="true" 
                    xsi:type="nex:LiteralMeta"
                    datatype="xsd:boolean"/>
            </node>
            <node id="tree2n5" label="n5" otu="t3"/>
            <node id="tree2n6" label="n6" otu="t2"/>

(Or see richer examples here)

One clever thing about this is that the <meta> nodes use RDFa syntax, so that the data can be extracted by any generic RDFa tool, and can leverage ontologies directly. The external Semtools annotation examples you linked (like this one) look like powerful way to go about this. Curious if they could exploit the same RDFa trick?

I see there's already a beta schema for the Semtools annotation (sms-semannot.xsd); perhaps you could point me to the documentation for this? I guess we can already generate the annotations for some attributes programmatically, e.g. the standardUnits).

rvosa commented 11 years ago

Hi Matt, Carl,

nice to be in touch about this, and interesting to see how you guys are dealing with the same challenges. To give a more complete, applied example of the RDFa annotations, have a look at this TreeBASE study dump: https://github.com/rvosa/supertreebase/blob/master/data/treebase/S100.xml

(Carl, this directory holds all of TreeBASE, as you asked.)

What we're trying to do is embed the metadata about the study (publication data, GUIDs for taxa) inside the data file so that it can be extracted with generic tools. To wit, here are the triples that are thus generated:

http://www.w3.org/2012/pyRdfa/extract?uri=https%3A%2F%2Fraw.github.com%2Frvosa%2Fsupertreebase%2Fmaster%2Fdata%2Ftreebase%2FS100.xml&format=turtle&rdfagraph=output&vocab_expansion=false&rdfa_lite=false&embedded_rdf=true&space_preserve=true&vocab_cache=true&vocab_cache_report=false&vocab_cache_refresh=false

Cheers,

Rutger

cboettig commented 11 years ago

@mbjones Guess I should learn to use a computer. I see there's already a lot of information about the semtools approach here: https://code.ecoinformatics.org/code/semtools/trunk/dev/sms/README.txt

I suppose we can follow a similar approach to morpho of providing a semtools R package that could be used as a 'plugin' with reml.

ropensci / EML