Closed mbjones closed 5 years ago
Thanks, we should definitely do this. Right, xml2
doesn't support SAX / event parsing at this time (see https://github.com/r-lib/xml2/issues/10), (though the underlying libxml2 does). The R package probably has other bottlenecks anyway, but can you point to some good large EML files for benchmarking against?
The trivial way (we actually had in EML 1.0) would be to just upload to your online Java validator, but a local validation would be nicer.
Trying to distill this list a bit to think about how to test:
id
+system
attributes on any element should be unique.<references>
and an attribute called id
.After that, I get a little more fuzzy. I get the idea that we cannot repeat an object that has an id
(we should use references
instead) but that's covered ensuring the id is unique. It sounds like it would also be technically correct to have, say:
<creator>
<individualName>
<givenName>M</givenName>
<surName>Jones</surName>
</individualName>
</creator>
appear twice in the same EML document (e.g. as both creator of metadata and in the author list of some paper cited in methods sections, etc)? Is that correct? It's like it would be basically impossible to detect and enforce that provision though -- without and id
we cannot be sure these are the same object (i.e. same person), right? Of course if it is repeated with id
than the test for unique id catches it.
other things on the above list:
packageId
is required on eml
-- I believe the schema enforces this? Seems like saying attribute x is required for element y is a standard schema thing, right?One thing you didn't mention -- isn't it necessary to make sure that an object with an id
appears before it is referenced?
I think this raises some very interesting larger questions though too. For one, the R package is probably a liability for creating duplicates when references should be used instead (though this can be elegantly solved in eml2
).
But there's also a deeper question to my mind about the use of <references>
, particularly because the desired behavior falls outside the scope of typical XML operations. For instance, it's not obvious that it would be wise to de-duplicate any occurrence of an author in a long list of references (clearly not a problem for the bibtex based format since it's not XML elements). Do most XSLT-based approaches for rendering a web page, say, of EML data handle resolving references well?
Lastly, I just want to note that, possibly in contrast to some XML tooling, I think JSON-LD really excels at this use of references. Duplicating objects with the same id is permitted in JSON-LD, but compacting or framing can be used to replace all but one of these to simple references; consistent with the EML rule of no duplicates (alternately, you can ask it to embed all reference objects explicitly, which can be nicer for software dev, since things like contact.address
and creator.address
can both resolve out-of-the-box without programming around the reference block.
If two objects have the same id, JSON-LD will simply merge properties. For instance, if both creator
and contact
have the same property for surName
but contact
also has a property for electronicMailAddress
, compacting will give you a single creator
object that also has the email address, and refer to the contact by the id.
I think your analysis is right on @cboettig. Last week we discussed clarifying these rules in the spec, so I filed https://github.com/NCEAS/eml/issues/306 to cover that. However, where you say:
One thing you didn't mention -- isn't it necessary to make sure that an object with an id appears before it is referenced?
That is not a requirement, and elements can and do get referenced before they are defined. Which is one thing that makes validating them impossible within the XSD world, which has a similar feature with key/keyref pairs. The eml-dev archives has a series of threads on why we as a community decided not to use key/keyref. So, in our case, one must accumulate a list of all IDs and all references and then compare them. In the Java parser that accumulation is done via a DOM model, which is why it is so slow on large documents, and why we want to switch to using SAX and a much lighter-weight model. There is an example doc that is slow to process attached to https://github.com/NCEAS/eml/issues/1.
The XSLT we created for EML and that we use in various repositories supports resolution of references, but it is not straightforward and I've seen other sites that just ignore references. Its a useful feature, but complex enough that some implementations don't deal well with it.
Migrating this over to emld
where the eml_validate
code actually lives now (though it's re-exported here).
Note validation rules have been written up here: https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/docs/eml-validation-refs.md
And that will be included in the next EML release (2.2.0).
the
eml_validate
function only checks schema validity. The EML specification madates several other validity requirements which are not covered by the XSD files, but which are enforced by the EMLParser. As a result, there are EML documents created in R which are invalid and get rejected by repositories even after passing theeml_validate
method check. TO fix this, add the other validity checks toeml_validate
so it is compliant with the specification. Details follow. @maier-m may be willing to help with these changes.The additional rules beyond schema validation are written in section 3.3 Reusable Content in the EML spec: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html#reusableContent I list them here for convenience:
system
attribute cannot exist in a single document.document
scope is defined as identifiers unique only to a single instance document (if a document does not have a system attribute or if scope is set to 'document' then all IDs are defined as distinct content).system
scope is defined as identifiers unique to an entire data management system (if two documents share a system string, then any IDs in those two documents that are identical refer to the same object).What would be great is if we wrote a function that could check all of these issues as an XML document is being parsed, and then call that after the
xml2::xml_validate
call is made. Both must be valid for the EML document to be considered valid.The current EMLParser is slow in part because it tries to do these checks in memory by loading the XML document as a DOM, and then querying the DOM for matches. A better algorithm is planned to fix the Java EMLParser (Issue https://github.com/NCEAS/eml/issues/1). In this approach, we would 1) use a SAX parser to parse the EML document, and 2) record all
id
,reference
, and element details in a data structure as they are encountered, and 3) once the whole document is parsed, do the id/ref comparisons for uniqueness and for following the rules. The same approach could be implemented in R, but I'm not sure if xml2 supports SAX parsing. If not, the loaded XML document might be usable to directly query for rule checking. Let's discuss.This request originated from a discussion by our data team and recorded in https://github.com/NCEAS/datamgmt/issues/133