ropensci / emld

:package: JSON-LD representation of EML
https://docs.ropensci.org/emld
Other
13 stars 6 forks source link

Remove dependency of eml_validate on schemaLocation #53

Closed amoeba closed 4 years ago

amoeba commented 4 years ago

Stemming from discussion in https://github.com/ropensci/emld/issues/44, we wanted to change the behavior of eml_validate to no longer validate by guessing the schema to validate by parsing the xsi:schemaLocation value on the root. Now, eml_validate determines the schema file to validate with in a more XML-aware way using a combination of the full QName of the root element and any namespaces defined on the root.

See eml_validate for two new helpers: find_real_root_name and guess_root_schema. These are really workarounds for a limitation of xml2 that may be removed in the future. Basically, xml2 can't give us the full QName of the root element so we can't know which namespace the root is from. To work around thsi, we parse the document with regular expressions (hacky but not horrible) to find the QName and go from there.

Closes #52

amoeba commented 4 years ago

As a note, I ran R CMD CHECK on emld and all is good. I also ran the tests on EML to make sure we weren't breaking anything there.

cboettig commented 4 years ago

:clap: :clap: :tada:

Thanks for this @amoeba. Now I recall running into the QName problem when trying to get validation to work initially. Still surprised this is missing from the xml2 bindings but this solution looks reasonable to me!

Minor thing, but can you just maybe add a test file that doesn't specify a schemaLocation? I don't think we had one yet and I think such a file would have caused a test failure prior to this patch, so it would be good to include.

It's probably a subject for a separate PR and maybe not mission critical enough, but I think that with this fix some of the logic in test-validate.R round trip checks could be simplified. Most importantly, I think there are some test files that had been skipped for wrongly throwing validation errors, and it would be good to get them un-skipped. I think the first step there is probably extending the test-validator code to confirm that the validator is correctly scoring all valid and invalid test files (without doing any of the parsing/serializing round trip stuff of test-validate.R, which is really a test of those read/write methods and not intended to be a test of validation routines itself).

cboettig commented 4 years ago

@amoeba also can you maybe give a concise statement of the issue into NEWS.md? Of course can link the PR and related issues.

amoeba commented 4 years ago

Minor thing, but can you just maybe add a test file that doesn't specify a schemaLocation? I don't think we had one yet and I think such a file would have caused a test failure prior to this patch, so it would be good to include.

Sure thing. Good call on this: The validator still works but the roundtrip test breaks. Should be a quick fix.

@amoeba also can you maybe give a concise statement of the issue into NEWS.md? Of course can link the PR and related issues.

Yep! I'll try to wrap this today.

amoeba commented 4 years ago

All ready for a look. See NEWS.md in this PR for a description of the changes.

To add tests for docs without schemaLocation, I ended up also giving the roundtrip function the ability to test that invalid docs are indeed invalid. Also, do to the new behavior where emld automatically adds schemaLocation if absent in the source document, I had to remove schemaLocation from the before and after count comparison.

We'll have to have a look in the future about cleaning up the validation tests as we've been discussing here and on Slack.

cboettig commented 4 years ago

Awesome, this looks really good to me, and glad to have these issues addressed, even given the limitations with xml2 bindings.

I'm happy to merge this and prepare a CRAN update. @mbjones let me know if you want to give this a code review first?

amoeba commented 4 years ago

Last night I read the source for the libxml2 bindings and figured out that I can in fact get the namespace on the root so I'd like to re-factor this once more before merging.

amoeba commented 4 years ago

A lot simpler now: https://github.com/ropensci/emld/pull/53/commits/c746ad5c6bbc7d535ae77eb15caba188cbc64cc4.

mbjones commented 4 years ago

I just read through and this all sounds like an improvement. I don't have cycles to review this week, so I'd say go for it and we can always tweak it later. I think getting this out will be an improvement that everyone will benefit from. So, 👍 from me @cboettig

On a side note, I still don't see why we need to determine the QName of the root node at all to get validation to work. I'll need to look more carefully at the code, but it seems to me that we shouldn't have to tell libxml2 at all what the document type is -- it can figure it out on its own. That's what we do on the Java side. In particular, we really should just be providing an XML Catalog file listing all of the schema namespaces we know about, and then libxml2 will match namespaces using that. It seems to me the root of our problem is that xml2 only provides the method xml2::xml_validate(x, schema), and doesn't seem to support XML catalogs. Which means the application has to guess what the right schema is, when in fact that can and should be done by libxml2. And it also means that the validate function can only pass one schema document, when in reality multiple might be needed (e.g., EML and STMML schemas). The libxml2 docs say the following about XML Catalogs:

In a normal environment libxml2 will by default check the presence of a catalog in /etc/xml/catalog, and assuming it has been correctly populated, the processing is completely transparent to the document user.

The user can change the default catalog behaviour by redirecting queries to its own set of catalogs, this can be done by setting the XML_CATALOG_FILES environment variable to a list of catalogs, an empty one should deactivate loading the default /etc/xml/catalog default catalog.

So we should be able to add a catalog parameter to xml2::xml_validate that would get used to look up schema locations by redirecting via XML_CATALOG_FILES. The new signature might be xml2::xml_validate(x, schema=NA, catalog=NA) and when called with catalog='./catalog.xml' would set the appropriate environment variable before calling libxml2 routines (and would still honor schema docs that are directly passed in.

So, a bit more to discuss here, but maybe we should do that under a new issue?

amoeba commented 4 years ago

Thanks @mbjones. Re: /etc/xml/catalog/XML_CATALOG_FILES, I did see that when researching a fix here but couldn't get them to work with XML schemas. The docs indicate they should work with DTDs but don't mention XML schemas explicitly. The libxml docs indicate that schema support is incomplete and I haven't been able to find any worked examples online (yet?) so I wonder if catalogs may not work with XML schemas. I've followed an approach based on this excellent reference but didn't have any luck.

cboettig commented 4 years ago

@amoeba @mbjones Thanks both for the feedback! Merging this now but please do open another issue to track the above discussion. Sounds like it might be worth a bug report against xml2 R package, maybe linking back here for our reference.

Unrelated, but you can see devel is now failing due to ... being documented but not used, which last night's version of R-devel has now decided deserves a WARNING (and the resulting pls fix emails from CRAN for both emld and EML. I'll try and squash those tonight I guess and get this off.