ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

read_eml errors when reading a valid EML file #227

Closed clnsmth closed 5 years ago

clnsmth commented 6 years ago

Hello everyone,

I am using read_eml of the EML library (1.0.3) to import EML from the Environmental Data Initiative (EDI) data archive as part of a workflow and have encountered occasional errors in reading in what appear to be valid EML files. The issue seems to arise in emlToS4, but that's as far as I've been able to track it. Below are my trouble shooting notes. Any help resolving this issue is much appreciated. Thanks!

# read_eml errors out when inputting otherwise valid EML documents. 
# Validity has been confirmed by 3 sources:
# 1. The EDI data archive quality checker.
# 2. The Oxygen XML editor validation check.
# 3. The XML R library validation check (results below).

# Three valid EML files that result in error:
file_in <- "http://pasta.lternet.edu/package/metadata/eml/knb-lter-mcr/6/55"
file_in <- "http://pasta.lternet.edu/package/metadata/eml/knb-lter-mcr/7/29"
file_in <- "http://pasta.lternet.edu/package/metadata/eml/knb-lter-mcr/12/16"

# Three valid EML files that do not result in error:
file_in <- "http://pasta.lternet.edu/package/metadata/eml/knb-lter-mcr/33/25"
file_in <- "http://pasta.lternet.edu/package/metadata/eml/knb-lter-mcr/1036/6"
file_in <- "http://pasta.lternet.edu/package/metadata/eml/knb-lter-mcr/5001/10"

# Example error
library(EML)
file_in <- "http://pasta.lternet.edu/package/metadata/eml/knb-lter-mcr/6/55"
eml <- read_eml(file_in)

Error in checkSlotAssignment(object, name, value) : 
  assignment of an object of class “eml:language” is not valid for slot ‘language’ in an object of class “checkConstraint”; is(value, "xml_attribute") is not TRUE

# Use XML library to read in XML and validate relative to the EML schema.
library(XML)
xml <- xmlParse(file_in)
xsd <- xmlParse("http://nis.lternet.edu/schemas/EML/eml-2.1.1/eml.xsd", isSchema =TRUE)
xmlSchemaValidate(xsd, xml)

$status
[1] 0

$errors
list()
attr(,"class")
[1] "XMLStructuredErrorList"

attr(,"class")
[1] "XMLSchemaValidationResults"

# NOTE: status == 0 indicates valid xml
amoeba commented 6 years ago

Very interesting. I'm still looking into this but I can confirm the error on my end. Here's the stack trace after a read_eml that produces this error to save anyone else time while debugging:

> traceback()
31: stop(gettextf("assignment of an object of class %s is not valid for slot %s in an object of class %s; is(value, \"%s\") is not TRUE", 
        dQuote(valueClass), sQuote(name), dQuote(cl), slotClass), 
        domain = NA)
30: checkSlotAssignment(object, name, value)
29: `slot<-`(`*tmp*`, "language", value = character(0))
28: .local(.Object, ...)
27: initialize(value, ...)
26: initialize(value, ...)
25: new("checkConstraint")
24: .class1(object)
23: as(checkConstraint, "checkConstraint")
22: .local(.Object, ...)
21: initialize(value, ...)
20: initialize(value, ...)
19: new(node_name)
18: FUN(X[[i]], ...)
17: lapply(children[i], xml_to_s4)
16: initialize(value, ...)
15: initialize(value, ...)
14: new(listclass, lapply(children[i], xml_to_s4))
13: listof(children, child, i)
12: parse_xml(child, children, cls)
11: FUN(X[[i]], ...)
10: lapply(children[i], xml_to_s4)
9: initialize(value, ...)
8: initialize(value, ...)
7: new(listclass, lapply(children[i], xml_to_s4))
6: listof(children, child, i)
5: parse_xml(child, children, cls)
4: xml_to_s4(children[[i]])
3: parse_xml(child, children, cls)
2: emlToS4(node)
1: read_eml(path)
clnsmth commented 6 years ago

Hi @amoeba. Thanks for your prompt engagement with this issue. After some brainstorming with the MCR information manager Gastil Gastil-Buhl and @cgries we have further isolated the issue to the presence/absence of the constraint node located at /eml/dataset/dataTable/constraint. Presence of this node results in the above listed read_eml error, absence does not. I hope this helps!

cboettig commented 6 years ago

@clnsmth Thanks for the bug report and the narrowing down, that definitely helps.

Error can be reproduced with:

new("constraint")

so something is indeed wrong with this class. The S4 initialization method for this class seems to be confusing the lang node attribute (<node lang="en">) with the language node element (<language>) in EML... will need to dig deeper

cboettig commented 5 years ago

Looks like these parsing issues are resolved in the EML (candidate 2.0 release), which can read the above file fine.