ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
97 stars 33 forks source link

EML::eml_validate conflicts with knb.ecoinformatics.org parser & appears to introduce invalid xml into valid files #348

Open RobLBaker opened 1 year ago

RobLBaker commented 1 year ago

This is fairly odd behavior and may be specific to older EML files. I suspect it has to do with the eml_validate() function not being backwards-compatible with EML schema 2.1.1. But specifying schema 2.1.1 changes - but does not solve - the problem.

I downloaded an older data package with metadata built under EML 2.1.1. I checked the validity of the EML file using https://knb.ecoinformatics.org/emlparser and found that it passed both XML and EML specific tests. I then read the file in to R. EML::eml_validate() found that it contained invalid EML. Thus, when I wrote it back to .xml it re-arranged some aspects of the original EML file. When I re-ran the parser tests at knb.econinformatics.org, the newly exported file failed the XML-specific tests. Is the EML package introducing invalid xml into (valid?) EML-formatted .xml files?

I downloaded the following data package: https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-and.4780.4 and ran the file, "knb-lter-and.4780.4.xml" through the EML parser at https://knb.ecoinformatics.org/emlparser/. The file passed both XML-specific and EML-specific tests.

I read the file in to R using EML::read_eml(), checked to see whether it validated using EML::eml_validate() (with schemas 2.1.1 and 2.2.0) and then wrote it back to xml using EML::write_eml():

mymeta<-EML::read_eml("knb-lter-and.4780.4.xml", from="xml")

The EML does not validate using schema 2.2.0. Perhaps this is not unexpected, given it was created under 2.1.1.:

EML::eml_validate(mymeta) [1] FALSE attr(,"errors") [1] "Element 'boundingCoordinates': This element is not expected. Expected is one of ( geographicDescription, references )." (and 17 additional identical errors are listed)

Switched to schema 2.1.1:

emld::eml_version("eml-2.1.1") [1] "eml-2.1.1" EML::eml_validate(mymeta) [1] FALSE attr(,"errors") [1] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'packageId' is required but missing."
[2] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'system' is required but missing."
[3] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': Missing child element(s). Expected is one of ( access, dataset, citation, software, protocol )."

In this case the EML doesn't validate, but it appears that the problem is despite switching to schema 2.1.1, the eml_validate function is still checking against version 2.2.0, but it does not seem to have problems with the geography (or is simply not reporting them?).

In any case, I can then write the object back to .xml:

EML::write_eml(mymeta, "exportedEML.xml")

The newly exported "exportedEML.xml" file now contains the namespace conflicts described in issue #347, despite having specified that the EML 2.1.1 schema should be used prior to calling the EML::write_eml function

When I now check the exportedEML.xml file using the EML parser at https://knb.ecoinformatics.org/emlparser/ I find that although it passes EML-specific tests, it fails XML-specific tests:

XML specific tests: Failed
The following errors were found:
cvc-complex-type.2.4.a: Invalid content was found starting with element 'boundingCoordinates'. One of '{geographicDescription, references}' is expected.

Has the EML package introduced invalid XML into the file?

Further comparisons of the .xml files indicates that various elements within the original knb-lter-and.4780.4.xml have been re-arranged compared to the exportedEML.xml file. Specifically, in the original knb file, there are 18 elements listed under with the following general format:

<spatialSamplingUnits>
     <coverage>
          <geographicDescription>HJA Phenology Sites</geographicDescription>
               <boundingCoordinates>
                    <westBoundingCoordinate>-122.26083000</westBoundingCoordinate>
                    <eastBoundingCoordinate>-122.11159208</eastBoundingCoordinate>
                    <northBoundingCoordinate>44.28199677</northBoundingCoordinate>
                    <southBoundingCoordinate>44.20198189</southBoundingCoordinate>
                    <boundingAltitudes>
                         <altitudeMinimum>1314</altitudeMinimum>
                         <altitudeMaximum>1314</altitudeMaximum>
                         <altitudeUnits>meter</altitudeUnits>
                    </boundingAltitudes>
               </boundingCoordinates>
     </coverage>

Whereas in the exportedEML.xml file, the corresponding elements have the following arrangement:

<spatialSamplingUnits>
     <coverage>
          <boundingCoordinates>
               <westBoundingCoordinate>-122.26083000</westBoundingCoordinate>
               <eastBoundingCoordinate>-122.11159208</eastBoundingCoordinate>
               <northBoundingCoordinate>44.28199677</northBoundingCoordinate>
               <southBoundingCoordinate>44.20198189</southBoundingCoordinate>
               <boundingAltitudes>
                    <altitudeMinimum>1314</altitudeMinimum>
                    <altitudeMaximum>1314</altitudeMaximum>
                    <altitudeUnits>meter</altitudeUnits>
               </boundingAltitudes>
          </boundingCoordinates>
          <geographicDescription>HJA Phenology Sites</geographicDescription>
     </coverage>

As you can see, the children of have been re-arranged in alphabetical order, which seems to be the default approach for EML::write_eml when handling invalid EML. Except in this case, was the EML invalid? knb's EML parser says it was valid. If the original file was valid EML, then the EML package appears to be taking valid EML and turning it into an invalid format that does not pass XML tests (or the EML::eml_validate test). Either way, I would not expect reading and then writing a (valid?) EML file to introduce these sorts of changes.

jeanetteclark commented 1 year ago

Hi @RobLBaker , I tracked down the problem and wrote up an issue over in the sister repository emld. Thanks for the report