Closed cboettig closed 11 years ago
Note that the schema does not completely specify all of the restrictions of the EML Specification. The conditional restrictions on uniqueness of IDs depending on the value of the scope attribute was simply not expressible in XML Schema. So, we wrote the EML Validator to accompany the spec which fully validates the document. This is available as a service on the web (http://knb.ecoinformatics.org/emlparser/), and can be run from a Java API call after compiling the EMLParser validation class. Source code is in the EML SVN repo: https://code.ecoinformatics.org/code/eml/trunk/src/org/ecoinformatics/eml/EMLParser.java
Fantastic, was wondering about this.
Will have to see if I can figure out the java API call; we should have the wrappers available in R but I have virtually no java experience. I suppose we could alternatively bundle the java in the R package and validate locally.
Looks like this gives us three levels of validation: whether the EML parses, whether it validates according to the schema, and whether we meet EML Validator checks for ids. Any advice on the right workflow / user interface for this would be good -- e.g. not sure that we would want to run the EML validator every time a user tries to read in or write out some EML -- it might be sufficient to know that it parses. Still, we want to provide support for these tools...
Also, I'm not actually clear on how / when we should be going about
generating element ids in the first place. The only place I have element
ids currently is on <attribute>
nodes (and one on the
<additionalMetadata id = 'figshare'>
for specifying what metadata is
exposed to figshare's database). Any advice on how to come up with
<attribute>
ids? (currently I create a hash from the
<attributeDescription>
text, just as a placeholder -- obviously this is
not what we actually want).
On Sat, Jun 29, 2013 at 1:12 PM, Matt Jones notifications@github.comwrote:
Note that the schema does not completely specify all of the restrictions of the EML Specification. The conditional restrictions on uniqueness of IDs depending on the value of the scope attribute was simply not expressible in XML Schema. So, we wrote the EML Validator to accompany the spec which fully validates the document. This is available as a service on the web ( http://knb.ecoinformatics.org/emlparser/), and can be run from a Java API call after compiling the EMLParser validation class. Source code is in the EML SVN repo: https://code.ecoinformatics.org/code/eml/trunk/src/org/ecoinformatics/eml/EMLParser.java
— Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/7#issuecomment-20236408 .
Carl Boettiger UC Santa Cruz http://carlboettiger.info/
Regarding the validator -- it doesn't check that much, and so I think it would be much better to just reimplement those checks in R. The Java code just iterates across a bunch of XML elements, checking that the id attribute on those elements is unique within the document, and then checks that any <references>
elements in the document point at an @id
in the document -- i.e., there are no dangling pointers. The list of elements that are checked is in a config file located here: https://code.ecoinformatics.org/code/eml/trunk/lib/config.xml
About when to check validity -- in Morpho we check validity when we try to upload to an external repository, as we want to ensure that everything is in good working order before distributing the document.
The only time you really need to generate ids is when you want to have a handle to reference. The main place for that is for <attribute>
elements. The other place that it is commonly used is for unit definitions in STMML in associatedMetadata, and to provide a globally-scoped identifier for individuals (e.g., provide an ORCID ID for someone when listing them as a Creator).
@duncantl added code to validate EML from R using the online Java tools from @mbjones in commit 1722a4be7297c8905179f945385092b0ebaedc4d
Philosophical point: Validation is really a concern for us / developers, not for the end users. Our programmatically generated EML should always be valid. Meanwhile, if we read in EML that is not valid, what are we going to do, just throw an error? Better to make the most of it.
Note: we can validate in R using the XML package (via libxml2
), using the xmlSchemaValidate()
function, though if I understand correctly, the code added above should also perform the additional validation that @mbjones describes.
There are two examples in a if(FALSE) {} block at the top of the file containing the functions. These can be used for a unit test. Also, we should either raise an error or put a class on the result of processValidateResponse if either of the tests were not passed.
(Not closed at all!)
The recursion problem in the XMLSchema package where a schema imports another schema which imports the first one is working now. Use inline = FALSE in the call to readSchema.
library(XMLSchema)
x = readSchema("~/Downloads/eml-2.1.1/eml.xsd", inline = FALSE)
This cures the parsing and processing of the types. It remains to be seen if it breaks anything else. And of course there will still be issues with the actual type descriptions it has created.
@duncantl Great, readSchema
works for me. Hitting an error when I then try defineClasses
Note: method with signature 'RestrictedStringDefinition#list' chosen for function 'resolve',
target signature 'RestrictedStringDefinition#SchemaCollection'.
"SchemaType#SchemaCollection" would also be valid
Error in .getClassFromCache(Class, where) :
attempt to use zero-length variable name
Also, xmlSchemaValidate()
seems unhappy about my eml files, even though they validate fine with your new eml_validate
function...
> xmlSchemaValidate("inst/xsd/eml.xsd", "inst/doc/my_eml_data.xml")
$status
[1] 1845
$errors
[[1]]
$msg
[1] "Element '{eml://ecoinformatics.org/eml-2.1.0}eml': No matching global declaration available for the validation root.\n"
$code
XML_SCHEMAV_CVC_ELT_1
1845
$domain
XML_FROM_SCHEMASV
17
$line
[1] 2
$col
[1] 0
$level
XML_ERR_ERROR
2
$filename
[1] "inst/doc/my_eml_data.xml"
attr(,"class")
[1] "XMLError"
attr(,"class")
[1] "XMLStructuredErrorList"
attr(,"class")
[1] "XMLSchemaValidationResults"
I am looking into to the defineClasses() and defClass() issues.
As for the validation error, I suspect the error message is correct although I can't tell what schema you are using. I imagine you are using the eml-2.1.1 while the my_eml_data.xml is using the namespace eml... 2.1.0
Indeed! We've moved to reml to writing 2.1.1 by default, #36 and now our example generated EML passes xmlSchemaValidate()
against the 2.1.1 eml.xsd
, as expected.
Sounds good -- keep us posted on defineClasses()
@mbjones is there an external URL I can use to validate against? (Just for the schema files, I know we can use http://knb.ecoinformatics.org/emlparser/ but have to figure out what went wrong in the RHTMLForm
function first).
Currently I have a local copy of the schema downloaded that I use, but a more portable solution would be better.
@cboettig Not quite sure what you are asking. The parsing service can be called at:
http://knb.ecoinformatics.org/emlparser/parse
as long as you do an HTTP POST and provide the proper parameters. For example, with curl
you could do:
curl -F action=textparse -F doctext=@/Users/jones/Desktop/eml-sample.xml http://knb.ecoinformatics.org/emlparser/parse
Is there a canonical URL for for eml.xsd? I only see the option to download the xsd files as a tarball, not browse them as web files....
Carl Boettiger http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos On Sep 3, 2013 3:51 PM, "Matt Jones" notifications@github.com wrote:
@cboettig https://github.com/cboettig Not quite sure what you are asking. The parsing service can be called at:
http://knb.ecoinformatics.org/emlparser/parse
as long as you do an HTTP POST and provide the proper parameters. For example, with curl you could do:
curl -F action=textparse -F doctext=@/Users/jones/Desktop/eml-sample.xml http://knb.ecoinformatics.org/emlparser/parse
— Reply to this email directly or view it on GitHubhttps://github.com/ropensci/reml/issues/7#issuecomment-23753570 .
Ah. Now I see. No, we do not provide one, because it is a security hole for applications to use an external schema for validation (similar to a SQL injection attack, this is an XML Injection, also called XML External Entity XXE Processing. Although our copy may be secure now, if many applications point at it, then it becomes an attractive and central point of attack, and if our host is compromised, then all apps that point at our schema URL would potentially be compromised as well. Compromises can lead to reading sensitive data on your computer (such as files like /etc/passwd), injection of malicious content into your application, and other maladies. So, we try to make it hard for people to be insecure. In general, trusting xsi:schemaLocation is allowing a third party to inject data into your process -- you are better served by downloading the schema, inspecting it, and if it is trustworthy, pointing at your local copy for validation.
@mbjones I'm not sure I follow the logic. Wouldn't that mean even more so by extension that webpages shouldn't include .js from anywhere, shouldn't include CSS from anywhere, etc, not even from the originating website because that would make it a target for being compromised? W3C schemas do give the schema location, and xs:imports even require it. For example, here's the PROV schema: http://www.w3.org/ns/prov-core.xsd Are you suggesting they are giving a bad example?
Yes, indeed -- you should only include javascript from a highly trusted source. That is even more of an issue than XML. Even when you don't intentionally include untrusted JS code, people develop clever XSRF and related attacks just to inject JS into your pages. Its a bad thing. I inspect any JS I include, lock it down as a local copy, and don't rely on external copies, as I would be trusting the security and goodwill of that host.
The W3C does know about these XML injection issues, and it influenced their web architecture documents. The W3C TAG discussed these issues, which was expressed in the web architecture principle of Reference does not imply dereference; although the security implications are mentioned, they are glossed over in that document. What they are saying is that just because someone provides an xsi:schemaLocation in their document is not an indication that you should dereference that in your parsing and validation. If that were so, then every document author would have the ability to inject potentially harmful content into your process. Rather, xsi:schemaLocation is defined as a 'hint' to help someone locate a schema for a namespace that is unknown, but blindly importing them is certainly an exploitable security hole. I first learned of these issues from a talk in 2000 by David Meggison from the W3C XML Working Group, but they persist today (and are actually easier, as there are now more injection vectors). The recommended practice to avoid XEE attacks is to download and inspect DTDs and schemas yourself, and then set up a catalog for mapping namespaces to the vetted local schema copy for validation, thereby avoiding potential injection attacks. XML (and SGML before it) catalogs are a common technology, and every XML and XSLT engine I have seen support them in their APIs. We use them in Metacat, and simply register each schema we wish to support with its associated xsd or dtd that we have inspected and stored locally. This is pretty simple and avoids user-driven content injection.
Our experience was that when we provided a resolvable copy of the schema, we started seeing many EML documents pointing at it (ok), and people automatically dereferencing it (bad). So we took it down. I think its reasonable to argue that in principle we should have a resolvable copy of the schema, but in practice it lead to bad dereferencing practices that we wanted to curtail in our community. For example, I think @cboettig was not aware that its a potential security hole to directly dereference the EML xsd, and would not have found that out if we had placed eml.xsd at a resolvable location. So, with this in mind, should we put up a resolvable copy?
Related pieces: https://www.owasp.org/images/5/5d/XML_Exteral_Entity_Attack.pdf http://www.soatothecloud.com/2008/08/dont-follow-that-schemalocation.html http://www.slideshare.net/qqlan/bh-ready-v4 http://www.securityfocus.com/archive/1/297846/30/0/threaded
@mbjones @hlapp Thanks both for the input and discussion here. A bit over my head but I'm trying to follow along. @mbjones Stupid question: the attack requires that the attacker alter the schema file that lives at the URL given?
Yes, the attacker must manipulate one of the information sets that will be injected into the parsing process, Within the document itself, these include external entities that are defined, and any of the multiple external files that can be included by reference to a 3rd party URI. In the specific case of the namespace, the xsi:schemaLocation will point at a schema document, and the attacker would need to modify that document, which would potentially compromise multitudes of computers if they all point at that single schema file.
@mbjones I know about the semantics of xsi:schemaLocation, and why it is only a hint for how to obtain the schema definition, but not required to be dereferencable. What I was asking is whether W3C is giving a bad example by providing in its own XSD documents xsi:schemaLocation and xsi:import URIs that actually do dereference, to the correct schema document, even though as you say it's not required. I'm also sure that lots of people do dereference their schemaLocation URIs, and yet they haven't found this undesirable.
I also didn't suggest that one include JS, or XSD for that matter, from arbitrary and untrusted sources. What I did say is that by your logic JS included on an NCEAS-served website that's loaded from the NCEAS server that serves the website is bad and not to be trusted, because NCEAS servers could get compromised and hacked. While that is of course a possibility, and that machines get hacked every day is a fact, surely I shouldn't conclude from that not to visit NCEAS websites in my browser, but download them first via cURL and then inspect them by hand for malicious content?
I don't find anything wrong or hazardous with trusting sites and URI locations provided by well-known institutions (such as, for example, the W3C), and I (and I don't think I'm alone with this) expect that these institutions will strive to apply best sysadmin practices to prevent such compromise. We certainly do so at NESCent, and we take security and compromise detection very seriously, and I would expect NCEAS to do no less. So I'm sorry, but I can't see the case for erecting barriers to developers who want to develop applications with your schemas. The W3C certainly doesn't.
The eml_validate() function is now working again. The problem seems to have been that the format of the HTML changed from using h2 to h4 for the relevant headers.
So nothing to do with RHTMLForms, just how we process the HTML response.
Ideally, the validator would allow us to request the response as XML or JSON and give it to us without the HTML formatting.
That would be a good change. We didn't originally design the validator that way, but it would be fairly easy to change it to output more structured data on request.
Since we have a working eml_validate
function at this point, I think we can close this issue. See #46 on workflow for validation.