ropensci / emld

:package: JSON-LD representation of EML
https://docs.ropensci.org/emld
Other
13 stars 6 forks source link

`as_eml_document`doesn't account for multiple types of `coverage` element #71

Open jeanetteclark opened 1 year ago

jeanetteclark commented 1 year ago

Over in EML we got a bug report that the EML package wasn't recognizing valid EML as being valid.

The problem stemmed from the element:

<sampling>
    <spatialSamplingUnits>
          <coverage>
            <geographicDescription>
...

After digging into the eml-2.x.x.json file, which is the backbone of the serialization process within as_eml_document, it seems that there is only one flavor of coverage included, of the eml-dataset:coverage variety, which requires a geographicCoverage prior to the geographicDescription element.

In the above example, however, the coverage shown above is eml-methods:coverage, which does not require the geographicCoverage child, since it has a geographicCoverage type.

This is definitely a bug, since the document was indeed valid, though I'm unsure as to how to proceed with fixing it. Any input from @cboettig or @mbjones would be most welcome

cboettig commented 1 year ago

Thanks @jeanetteclark , really appreciate the help here. I'm still a bit confused as the minimal reprex of the problem though. Is this error really coming from something emld does or is the XML file in question considered invalid by xml2::xml_validate ?

cboettig commented 1 year ago

ok never mind, reading more closely now. I see your point about

https://github.com/ropensci/emld/blob/master/data-raw/eml-2.2.0.json#L20

and your point about different flavors of the element which are obviously neglected under the data model there (which essentially ignores the namespace). XML is so much richer than JSON, it's hard to program against!

On a conceptual / philosophical level, is there a good reason for spatialSamplingUnits to use "coverage" to refer to a complexNode that appears to correspond to a node we would call "geographicCoverage" elsewhere? I know this is how it is in the schema and so we're stuck with it, but it feels a bit like a schema bug to me.

Anyway, we can probably hack around it in various ways, but probably none of them very elegant.

My intuition is that we should change:

https://github.com/ropensci/emld/blob/master/data-raw/eml-2.2.0.json#L127

to specify that the child element of spatialSamplingUnits is a geographicCoverage element as far as emld's json file is concerned. We would then need some additional code to monkey-patch around this issue when serializing / parsing from XML -- i.e. literally renaming any node spatialSamplingUnits/coverage to spatialSamplingUnits/geographicCoverage on parsing from XML and the reverse when serializing back to XML.

@mbjones / @jeanetteclark lemme know if that sounds reasonable or if I totally missed the boat here.

mbjones commented 1 year ago

Sounds reasonable to me. And it sounds like the problem would be solved by your solution, despite it being a bit ugly. I don't expect the EML schema to change for something like this, as it is perfectly valid (and unambiguous) in XML, and there's a lot of history to these schemas in use. So, I support whatever patching is needed in emld to force the incompatible json-ld model to match the actual eml.xsd model for validation.

jeanetteclark commented 1 year ago

Seems like a reasonable solution to me as well

cboettig commented 1 year ago

Thanks @mbjones , sounds like a plan!

@jeanetteclark any chance you want to take a stab at this? (I think it should just be an xpath query wrapped in after reading from xml or writing to xml. could be done after the xml -> list, but xpath is probably the easiest way to search for any occurrence of a spatialSamplingUnits/coverage node.)

jeanetteclark commented 1 year ago

Yeah I'm definitely willing to take a stab - I might be a bit slow as I work it in with other duties. I agree that xpath is probably easiest (and most reliable). I'll do a little digging to see what other areas of the schema might be problematic in this way too (hopefully this part isn't too big of a can of worms). While looking around for this I noticed a lot of other tucked away corners of the schema the use coverage element, and have no idea if this situation could arise elsewhere.