ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
97 stars 33 forks source link

namespace conflict introduced when importing/exporting EML generated under older schema #347

Open RobLBaker opened 1 year ago

RobLBaker commented 1 year ago

I ran across this interesting issue with an older EML file. I downloaded the file, imported it using EML::read_eml() and then wrote it back to .xml using EML::write_eml(). The result was a corrupted eml file with conflicts in the namespace that nevertheless passes the EML::eml_validate() validation check:

I downloaded a data package from EDI: https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-and.4780.4

The file knb-lter-and.4780.4.xml is an EML formatted file. Upon download, the initial eml tag in knb-lter-and.4780.4.xml looks like so:

<eml:eml xmlns:ds="eml://ecoinformatics.org/dataset-2.1.1" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" packageId="knb-lter-and.4780.4" system="https://pasta.edirepository.org/" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 [http://nis.lternet.edu/schemas/EML/eml-2.1.1/eml.xsd">

I then imported to R with EML::read_eml and wrote it back to .xml:

mymeta<-EML::read_eml("knb-lter-and.4780.4.xml", from="xml") View(mymeta) EML::write_eml(mymeta, "exportedEML.xml")

And when I open the new "exportedEML.xml" file I see:

<eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.2" xmlns:ds="eml://ecoinformatics.org/dataset-2.1.1" packageId="knb-lter-and.4780.4" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://nis.lternet.edu/schemas/EML/eml-2.1.1/eml.xsd" system="[https://pasta.edirepository.org">](https://pasta.edirepository.org%22%3E/)

It appears that even though the xmlns:eml attribute is now eml-2.2.0, the schema location (xsi:schemaLocation=) and xmlns:ds both still indicate the original EML 2.1.1.

Both files validate using EML::eml_validate(). I assume this is because the EML package does not actually use the namespace within the EML file to identify the schema to validate against but instead has that namespace hardcoded in elsewhere.

I understand it is possible to tell EML to switch between schema versions, but I still think this qualifies as a potential bug. I can see users generating an EML file under one schema and (perhaps years later) updating it under a second schema. In that scenario, this namespace conflict is easily introduced. If the default it to update everything to the latest schema, that should be done consistently.

On a side note, it would be nice to preserve the evolution of an EML file if it is edited under multiple different schemas during it's lifetime (for instance as a data package is incrementally added to and versioned). But I think there is likely a better place to systematically implement that version history than the eml namespace.