eml_validate doesn't check all EML validity rules

mbjones commented 6 years ago

the eml_validate function only checks schema validity. The EML specification madates several other validity requirements which are not covered by the XSD files, but which are enforced by the EMLParser. As a result, there are EML documents created in R which are invalid and get rejected by repositories even after passing the eml_validate method check. TO fix this, add the other validity checks to eml_validate so it is compliant with the specification. Details follow. @maier-m may be willing to help with these changes.

The additional rules beyond schema validation are written in section 3.3 Reusable Content in the EML spec: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html#reusableContent I list them here for convenience:

An ID is required on the eml root element (packageId)
IDs are optional on all other elements.
If an ID is not provided, that content must be interpreted as representing a distinct object.
If an ID is provided for content then that content is distinct from all other content except for that content that references its ID.
If a user wants to reuse content to indicate the repetition of an object, a reference must be used. Two identical ids with the same system attribute cannot exist in a single document.
- document scope is defined as identifiers unique only to a single instance document (if a document does not have a system attribute or if scope is set to 'document' then all IDs are defined as distinct content).
- system scope is defined as identifiers unique to an entire data management system (if two documents share a system string, then any IDs in those two documents that are identical refer to the same object).
If an element references another element, it must not have an ID itself. The system attribute must have the same value in both the target and referencing elements or it must be absent in both.
All EML packages must have the 'eml' module as the root.
The system and scope attribute are always optional except for at the 'eml' module where the scope attribute is fixed as 'system'. The scope attribute defaults to 'document' for all other modules.

What would be great is if we wrote a function that could check all of these issues as an XML document is being parsed, and then call that after the xml2::xml_validate call is made. Both must be valid for the EML document to be considered valid.

The current EMLParser is slow in part because it tries to do these checks in memory by loading the XML document as a DOM, and then querying the DOM for matches. A better algorithm is planned to fix the Java EMLParser (Issue https://github.com/NCEAS/eml/issues/1). In this approach, we would 1) use a SAX parser to parse the EML document, and 2) record all id, reference, and element details in a data structure as they are encountered, and 3) once the whole document is parsed, do the id/ref comparisons for uniqueness and for following the rules. The same approach could be implemented in R, but I'm not sure if xml2 supports SAX parsing. If not, the loaded XML document might be usable to directly query for rule checking. Let's discuss.

This request originated from a discussion by our data team and recorded in https://github.com/NCEAS/datamgmt/issues/133

cboettig commented 6 years ago

Thanks, we should definitely do this. Right, xml2 doesn't support SAX / event parsing at this time (see https://github.com/r-lib/xml2/issues/10), (though the underlying libxml2 does). The R package probably has other bottlenecks anyway, but can you point to some good large EML files for benchmarking against?

The trivial way (we actually had in EML 1.0) would be to just upload to your online Java validator, but a local validation would be nicer.

Trying to distill this list a bit to think about how to test:

[ ] combination of id+system attributes on any element should be unique.
[ ] an element cannot have both child element <references> and an attribute called id.

After that, I get a little more fuzzy. I get the idea that we cannot repeat an object that has an id (we should use references instead) but that's covered ensuring the id is unique. It sounds like it would also be technically correct to have, say:

<creator>
  <individualName>
    <givenName>M</givenName>
   <surName>Jones</surName>
  </individualName>
</creator>

appear twice in the same EML document (e.g. as both creator of metadata and in the author list of some paper cited in methods sections, etc)? Is that correct? It's like it would be basically impossible to detect and enforce that provision though -- without and id we cannot be sure these are the same object (i.e. same person), right? Of course if it is repeated with id than the test for unique id catches it.

other things on the above list:

packageId is required on eml -- I believe the schema enforces this? Seems like saying attribute x is required for element y is a standard schema thing, right?
all other ids are optional -- doesn't seem like this imposes any additional test

One thing you didn't mention -- isn't it necessary to make sure that an object with an id appears before it is referenced?

I think this raises some very interesting larger questions though too. For one, the R package is probably a liability for creating duplicates when references should be used instead (though this can be elegantly solved in eml2).

But there's also a deeper question to my mind about the use of <references>, particularly because the desired behavior falls outside the scope of typical XML operations. For instance, it's not obvious that it would be wise to de-duplicate any occurrence of an author in a long list of references (clearly not a problem for the bibtex based format since it's not XML elements). Do most XSLT-based approaches for rendering a web page, say, of EML data handle resolving references well?

Lastly, I just want to note that, possibly in contrast to some XML tooling, I think JSON-LD really excels at this use of references. Duplicating objects with the same id is permitted in JSON-LD, but compacting or framing can be used to replace all but one of these to simple references; consistent with the EML rule of no duplicates (alternately, you can ask it to embed all reference objects explicitly, which can be nicer for software dev, since things like contact.address and creator.address can both resolve out-of-the-box without programming around the reference block. If two objects have the same id, JSON-LD will simply merge properties. For instance, if both creator and contact have the same property for surName but contact also has a property for electronicMailAddress, compacting will give you a single creator object that also has the email address, and refer to the contact by the id.

mbjones commented 6 years ago

I think your analysis is right on @cboettig. Last week we discussed clarifying these rules in the spec, so I filed https://github.com/NCEAS/eml/issues/306 to cover that. However, where you say:

One thing you didn't mention -- isn't it necessary to make sure that an object with an id appears before it is referenced?

That is not a requirement, and elements can and do get referenced before they are defined. Which is one thing that makes validating them impossible within the XSD world, which has a similar feature with key/keyref pairs. The eml-dev archives has a series of threads on why we as a community decided not to use key/keyref. So, in our case, one must accumulate a list of all IDs and all references and then compare them. In the Java parser that accumulation is done via a DOM model, which is why it is so slow on large documents, and why we want to switch to using SAX and a much lighter-weight model. There is an example doc that is slow to process attached to https://github.com/NCEAS/eml/issues/1.

The XSLT we created for EML and that we use in various repositories supports resolution of references, but it is not straightforward and I've seen other sites that just ignore references. Its a useful feature, but complex enough that some implementations don't deal well with it.

cboettig commented 5 years ago

Migrating this over to emld where the eml_validate code actually lives now (though it's re-exported here).

mbjones commented 5 years ago

Note validation rules have been written up here: https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/docs/eml-validation-refs.md

And that will be included in the next EML release (2.2.0).

ropensci / EML

eml_validate doesn't check all EML validity rules #244