ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

eml_validate doesn't check all EML validity rules #244

Closed mbjones closed 5 years ago

mbjones commented 6 years ago

the eml_validate function only checks schema validity. The EML specification madates several other validity requirements which are not covered by the XSD files, but which are enforced by the EMLParser. As a result, there are EML documents created in R which are invalid and get rejected by repositories even after passing the eml_validate method check. TO fix this, add the other validity checks to eml_validate so it is compliant with the specification. Details follow. @maier-m may be willing to help with these changes.

The additional rules beyond schema validation are written in section 3.3 Reusable Content in the EML spec: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html#reusableContent I list them here for convenience:

What would be great is if we wrote a function that could check all of these issues as an XML document is being parsed, and then call that after the xml2::xml_validate call is made. Both must be valid for the EML document to be considered valid.

The current EMLParser is slow in part because it tries to do these checks in memory by loading the XML document as a DOM, and then querying the DOM for matches. A better algorithm is planned to fix the Java EMLParser (Issue https://github.com/NCEAS/eml/issues/1). In this approach, we would 1) use a SAX parser to parse the EML document, and 2) record all id, reference, and element details in a data structure as they are encountered, and 3) once the whole document is parsed, do the id/ref comparisons for uniqueness and for following the rules. The same approach could be implemented in R, but I'm not sure if xml2 supports SAX parsing. If not, the loaded XML document might be usable to directly query for rule checking. Let's discuss.

This request originated from a discussion by our data team and recorded in https://github.com/NCEAS/datamgmt/issues/133

cboettig commented 6 years ago

Thanks, we should definitely do this. Right, xml2 doesn't support SAX / event parsing at this time (see https://github.com/r-lib/xml2/issues/10), (though the underlying libxml2 does). The R package probably has other bottlenecks anyway, but can you point to some good large EML files for benchmarking against?

The trivial way (we actually had in EML 1.0) would be to just upload to your online Java validator, but a local validation would be nicer.

Trying to distill this list a bit to think about how to test:

After that, I get a little more fuzzy. I get the idea that we cannot repeat an object that has an id (we should use references instead) but that's covered ensuring the id is unique. It sounds like it would also be technically correct to have, say:

<creator>
  <individualName>
    <givenName>M</givenName>
   <surName>Jones</surName>
  </individualName>
</creator>

appear twice in the same EML document (e.g. as both creator of metadata and in the author list of some paper cited in methods sections, etc)? Is that correct? It's like it would be basically impossible to detect and enforce that provision though -- without and id we cannot be sure these are the same object (i.e. same person), right? Of course if it is repeated with id than the test for unique id catches it.

other things on the above list:

One thing you didn't mention -- isn't it necessary to make sure that an object with an id appears before it is referenced?

I think this raises some very interesting larger questions though too. For one, the R package is probably a liability for creating duplicates when references should be used instead (though this can be elegantly solved in eml2).

But there's also a deeper question to my mind about the use of <references>, particularly because the desired behavior falls outside the scope of typical XML operations. For instance, it's not obvious that it would be wise to de-duplicate any occurrence of an author in a long list of references (clearly not a problem for the bibtex based format since it's not XML elements). Do most XSLT-based approaches for rendering a web page, say, of EML data handle resolving references well?

Lastly, I just want to note that, possibly in contrast to some XML tooling, I think JSON-LD really excels at this use of references. Duplicating objects with the same id is permitted in JSON-LD, but compacting or framing can be used to replace all but one of these to simple references; consistent with the EML rule of no duplicates (alternately, you can ask it to embed all reference objects explicitly, which can be nicer for software dev, since things like contact.address and creator.address can both resolve out-of-the-box without programming around the reference block. If two objects have the same id, JSON-LD will simply merge properties. For instance, if both creator and contact have the same property for surName but contact also has a property for electronicMailAddress, compacting will give you a single creator object that also has the email address, and refer to the contact by the id.

mbjones commented 6 years ago

I think your analysis is right on @cboettig. Last week we discussed clarifying these rules in the spec, so I filed https://github.com/NCEAS/eml/issues/306 to cover that. However, where you say:

One thing you didn't mention -- isn't it necessary to make sure that an object with an id appears before it is referenced?

That is not a requirement, and elements can and do get referenced before they are defined. Which is one thing that makes validating them impossible within the XSD world, which has a similar feature with key/keyref pairs. The eml-dev archives has a series of threads on why we as a community decided not to use key/keyref. So, in our case, one must accumulate a list of all IDs and all references and then compare them. In the Java parser that accumulation is done via a DOM model, which is why it is so slow on large documents, and why we want to switch to using SAX and a much lighter-weight model. There is an example doc that is slow to process attached to https://github.com/NCEAS/eml/issues/1.

The XSLT we created for EML and that we use in various repositories supports resolution of references, but it is not straightforward and I've seen other sites that just ignore references. Its a useful feature, but complex enough that some implementations don't deal well with it.

cboettig commented 5 years ago

Migrating this over to emld where the eml_validate code actually lives now (though it's re-exported here).

mbjones commented 5 years ago

Note validation rules have been written up here: https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/docs/eml-validation-refs.md

And that will be included in the next EML release (2.2.0).