Closed amoeba closed 5 years ago
@amoeba Thanks for delving into this; much appreciated. I'd definitely welcome a PR against the validate_units()
routine to better handle the namespaces. I don't have a strong intuition for what approach is better; you emphasize your alternatives would work in the this particular case; presumably they'd also cover the current case?
Is this case included in the test suite of valid files? (I thought I'd covered all those, but maybe this one is newly added?) The unit tests currently haven't implemented checks over all the invalid_EML in the test suite, but I think they do cover the valid ones.
Regarding validation of the metadata
element, I haven't quite wrapped my head around what the quoted section means with respect to that conditional validation -- is it saying we'd do schema validation whenever some embedded XML metadata declares a schema (using a URL we can access?) Does this mean that validity of an EML file is conditional on the network access status/ability of processing application?
you emphasize your alternatives would work in the this particular case; presumably they'd also cover the current case?
yeah
Is this case included in the test suite of valid files?
I don't think so. It's an odd way to construct a custom unit. I think that hard-coding the namespace prefix name of 'stmml' is too brittle. For example, it's valid to define the stmml
namespace as xmlns:foo="http://www.xml-cml.org/schema/stmml_1.1"
which would mean the appropriate XPath to grab the unitLists would be //foo:unitList
. Given that, I'd probably go with a local-name
based approach which disregards the namespace prefix names.
Does this mean that validity of an EML file is conditional on the network access status/ability of processing application?
That's how I read it. I think it's pretty problematic. The customUnit
definition in the schema says:
This field identifies the unit of measurement for this attribute. It is a choice of either a standard unit, or a custom unit. If it is a custom unit, the definition of the unit must be provided in the document using the STMML syntax, and the name provided in the customUnit element must reference the id of its associated STMML definition precisely. For further information on STMML (http://www.xml-cml.org/stmml/) or see stmml.xsd which is included with the EML 2.0 distribution for details.
I read that as, customUnits must be described in a schema-valid STMML sub-tree. I think it may also be fair to say that the customUnit
defs should be valid according to the stmml.xsd
we ship with the release. That means that EML 2.1.1 and earlier get validated with STMML 1.1 and 2.2.0 and get validated with STMML 1.2. I don't think this is workable because EML has the backwards compatibility goal that, if you merely change the namespace to a more recent version, your document should validate. An alternative would be to ship multiple versions of the STMML XSDs.
Thanks. Yeah, that sounds like a convincing case for the local-name()
approach -- want to send a PR for that? IIUC, this avoids hardwiring the stmml version? (i.e. I believe xml2
's xpath
expression still understands prefixes in the context of the namespace and not literally as hardwired prefixes, at least if we pass the namespace argument appropriately...)
Not sure there's anything actionable on the details of separate STMML validation. From what I understand, it's not clear that STMML is actively used beyond EML for some time now, and it's not obvious to me that it shouldn't be rolled in to EML rather than handled separately. (as we've seen, the real challenges with EML units and customUnits in particular don't seem to be those that would be addressed by schema validation anyway).
I agree this is probably good enough for now. Thanks!
One of our EML tools generated some seemingly valid (though unusual) EML but
validate_units
didn't think so. I think the document is technically valid but validation fails:But the XML has:
Which looks fine but you can see the mixing of namespaces there. The
unitList
elements uses a local namespace and theunit
element uses the global namespace. The XPathvalidate_units
uses to extract the custom units doesn't like this situation:Two alternatives that would work in this case are:
Questions I have and could use some feedback on are:
Should
eml_validate
be enforcing schema validity as per the schema? Themetadata
element says:I think this is a bit of an odd rule but I think the intent was to be helpful to clients that want to be schema valid which makes sense.
Example EML (science_metadata (1).xml above)
```xmldevtools::session_info()
``` > devtools::session_info() ─ Session info ─────────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.5.2 (2018-12-20) os macOS Mojave 10.14.4 system x86_64, darwin15.6.0 ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/Juneau date 2019-03-04 ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0) backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0) callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.0) cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.0) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0) curl 3.3 2019-01-10 [1] CRAN (R 3.5.2) desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0) devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.2) digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0) emld * 0.2.0 2019-02-27 [1] local fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.0) glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0) jsonld 2.1 2019-02-05 [1] CRAN (R 3.5.2) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.0) magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0) pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0) processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.0) ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0) R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2) Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0) remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0) rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0) rstudioapi 0.9.0 2019-01-09 [1] CRAN (R 3.5.2) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0) testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.0) usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0) V8 2.0 2019-02-07 [1] CRAN (R 3.5.2) withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0) xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0) yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.0) [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library ```