Certain metadata is getting lost upon reading NeXML

hlapp commented 5 years ago

I found this from looking at top-level metadata; it's possible that other metadata elsewhere is getting lost too:

f <- system.file("examples", "ontotrace-result.xml", package="RNeXML")
nex <- read.nexml(f)
get_metadata(nex, level = "nexml")
#   LiteralMeta       property datatype content    xsi.type
# 1          NA     dc:creator       NA      NA LiteralMeta
# 2          NA dc:description       NA      NA LiteralMeta

dc:creator is indeed empty in the file, but dc:description is not:

  <meta xsi:type="LiteralMeta" property="dc:creator" />
  <meta xsi:type="LiteralMeta" property="dc:description">Generated from the Phen
oscape Knowledgebase on 2015-10-21 by Ontotrace query:
* taxa: &lt;http://purl.obolibrary.org/obo/VTO_0036217&gt;
* entities: &lt;http://purl.obolibrary.org/obo/BFO_0000050&gt; some &lt;http://p
url.obolibrary.org/obo/UBERON_0008897&gt;</meta>

This isn't a problem in the get_metadata() function:

nex@meta[[2]]@property
# [1] "dc:description"
nex@meta[[2]]@content
# character(0)

Other metadata in the file are coming back fine (though these are not LiteralMeta, if that's got something to do with it):

get_metadata(nex, level = "otus/otu")[1:3,]
#   ResourceMeta             rel                                       href     xsi.type
# 1           NA     dwc:taxonID http://purl.obolibrary.org/obo/VTO_0036225 ResourceMeta
# 2           NA rdfs:subClassOf http://purl.obolibrary.org/obo/VTO_0036217 ResourceMeta
# 3           NA     dwc:taxonID http://purl.obolibrary.org/obo/VTO_0061498 ResourceMeta
#           otu                                  otus
# 1 VTO_0036225 t0d4df580-2d92-4166-8518-a76116df5295
# 2 VTO_0036225 t0d4df580-2d92-4166-8518-a76116df5295
# 3 VTO_0061498 t0d4df580-2d92-4166-8518-a76116df5295

Any ideas @cboettig?

cboettig commented 5 years ago

Looks like RNeXML is assuming that literal metadata nodes are still using the content attribute for the contents, and not parsing the literal node contents. (e.g. the format we see in TreeBase:

<meta content="Mycologia" datatype="xsd:string" id="meta17" property="dc:publisher" xsi:type="nex:LiteralMeta"/>
  <meta content="Mycologia" datatype="xsd:string" id="meta16" property="prism:publicationName" xsi:type="nex:LiteralMeta"/>

Guess it shouldn't be assuming that. (I guess only resource meta nodes can be nested(?) so it should be okay to just parse the contents of any literal meta node? Presumably the schema doesn't allow you to do both a content attribute and literal content in the same node?

hlapp commented 5 years ago

The LiteralMeta schema definition says the following:

If the @content attribute is used, then the element should contain no children.

So I guess it's not enforced that it's one way or the other, but it seems it would be fair to just go by the guidance.

cboettig commented 5 years ago

See #193

ropensci / RNeXML

Certain metadata is getting lost upon reading NeXML #190