ropensci / emld

:package: JSON-LD representation of EML
https://docs.ropensci.org/emld
Other
13 stars 6 forks source link

validate_units edge case #34

Closed amoeba closed 5 years ago

amoeba commented 5 years ago

One of our EML tools generated some seemingly valid (though unusual) EML but validate_units didn't think so. I think the document is technically valid but validation fails:

> emld:::eml_validate("~/Downloads/science_metadata (1).xml")
[1] FALSE
attr(,"errors")
[1] "not all 'custom units are defined."
Warning message:
In validate_units(doc, encoding = encoding) :
  Document is invalid. Found the following errors:
 not all 'custom units are defined.

But the XML has:

...
<metadata>
  <stmml:unitList xmlns:stmml="http://www.xml-cml.org/schema/stmml_1.1">
    <unit id="micromolesPerGram" abbreviation="umol/g" name="micromolesPerGram" parentSI="molesPerGram" unitType="amountOfSubstanceWeight" multiplierToSI="1e-06">
...

Which looks fine but you can see the mixing of namespaces there. The unitList elements uses a local namespace and the unit element uses the global namespace. The XPath validate_units uses to extract the custom units doesn't like this situation:

> c(
+     xml2::xml_attr(xml2::xml_find_all(
+         doc,
+         "//stmml:unitList/stmml:unit"
+     ), "id"),
+     xml2::xml_attr(
+         xml2::xml_find_all(doc, "//unitList/unit"),
+         "id"
+     )
+ )
character(0)

Two alternatives that would work in this case are:

xml2::xml_find_all( doc, "//stmml:unitList/unit")
# or
xml2::xml_find_all(doc, "//*[local-name()='unitList']/*[local-name()='unit']")

Questions I have and could use some feedback on are:

Example EML (science_metadata (1).xml above) ```xml title creator contact someentity ... ... micromolesPerGram real micromoles per gram ```
devtools::session_info() ``` > devtools::session_info() ─ Session info ─────────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.5.2 (2018-12-20) os macOS Mojave 10.14.4 system x86_64, darwin15.6.0 ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/Juneau date 2019-03-04 ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0) backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0) callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.0) cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.0) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0) curl 3.3 2019-01-10 [1] CRAN (R 3.5.2) desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0) devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.2) digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0) emld * 0.2.0 2019-02-27 [1] local fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.0) glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0) jsonld 2.1 2019-02-05 [1] CRAN (R 3.5.2) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.0) magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0) pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0) processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.0) ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0) R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2) Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0) remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0) rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0) rstudioapi 0.9.0 2019-01-09 [1] CRAN (R 3.5.2) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0) testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.0) usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0) V8 2.0 2019-02-07 [1] CRAN (R 3.5.2) withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0) xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0) yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.0) [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library ```
cboettig commented 5 years ago

@amoeba Thanks for delving into this; much appreciated. I'd definitely welcome a PR against the validate_units() routine to better handle the namespaces. I don't have a strong intuition for what approach is better; you emphasize your alternatives would work in the this particular case; presumably they'd also cover the current case?

Is this case included in the test suite of valid files? (I thought I'd covered all those, but maybe this one is newly added?) The unit tests currently haven't implemented checks over all the invalid_EML in the test suite, but I think they do cover the valid ones.

Regarding validation of the metadata element, I haven't quite wrapped my head around what the quoted section means with respect to that conditional validation -- is it saying we'd do schema validation whenever some embedded XML metadata declares a schema (using a URL we can access?) Does this mean that validity of an EML file is conditional on the network access status/ability of processing application?

amoeba commented 5 years ago

you emphasize your alternatives would work in the this particular case; presumably they'd also cover the current case?

yeah

Is this case included in the test suite of valid files?

I don't think so. It's an odd way to construct a custom unit. I think that hard-coding the namespace prefix name of 'stmml' is too brittle. For example, it's valid to define the stmml namespace as xmlns:foo="http://www.xml-cml.org/schema/stmml_1.1" which would mean the appropriate XPath to grab the unitLists would be //foo:unitList. Given that, I'd probably go with a local-name based approach which disregards the namespace prefix names.

Does this mean that validity of an EML file is conditional on the network access status/ability of processing application?

That's how I read it. I think it's pretty problematic. The customUnit definition in the schema says:

This field identifies the unit of measurement for this attribute. It is a choice of either a standard unit, or a custom unit. If it is a custom unit, the definition of the unit must be provided in the document using the STMML syntax, and the name provided in the customUnit element must reference the id of its associated STMML definition precisely. For further information on STMML (http://www.xml-cml.org/stmml/) or see stmml.xsd which is included with the EML 2.0 distribution for details.

I read that as, customUnits must be described in a schema-valid STMML sub-tree. I think it may also be fair to say that the customUnit defs should be valid according to the stmml.xsd we ship with the release. That means that EML 2.1.1 and earlier get validated with STMML 1.1 and 2.2.0 and get validated with STMML 1.2. I don't think this is workable because EML has the backwards compatibility goal that, if you merely change the namespace to a more recent version, your document should validate. An alternative would be to ship multiple versions of the STMML XSDs.

cboettig commented 5 years ago

Thanks. Yeah, that sounds like a convincing case for the local-name() approach -- want to send a PR for that? IIUC, this avoids hardwiring the stmml version? (i.e. I believe xml2's xpath expression still understands prefixes in the context of the namespace and not literally as hardwired prefixes, at least if we pass the namespace argument appropriately...)

Not sure there's anything actionable on the details of separate STMML validation. From what I understand, it's not clear that STMML is actively used beyond EML for some time now, and it's not obvious to me that it shouldn't be rolled in to EML rather than handled separately. (as we've seen, the real challenges with EML units and customUnits in particular don't seem to be those that would be addressed by schema validation anyway).

amoeba commented 5 years ago

I agree this is probably good enough for now. Thanks!