Open frp opened 7 years ago
The DTD grammar and tokens are defined in the XML spec.
Work for this will probably largely live in parse_document_type_declaration
:
For now, we aren't really worried about amazing performance or even collecting the data, all we need to make sure of is that the parser can accept the input data.
A small example to add to the tests:
<?xml version="1.0"?>
<!DOCTYPE cXML SYSTEM "http://xml.cxml.org/schemas/cXML/1.2.014/cXML.dtd">
<cXML />
Strangely, it kind of seems like this much of it should already be supported; I'm guessing that there's some small mistake in what the parser accepts.
OK, so I've been tinkering and learning. So far, I have found that the presence of a /
character is enough to cause breakage.
I traced the failure back to parse_one_quoted_value
I'm assuming that the 2nd try_parse!()
here is meant to consume all characters up until the token passed in (in this case either a '
or a "
), yet it's terminating prematurely at the /
.
I don't really see a reason for this, but at this superficial level I suspect the bug is actually in the peresil
crate which provides the try_parse!()
macro. I may pivot over to write some tests for peresil
and see if I can reproduce with a simple case.
It's actually because the grammar for parse_external_id
got mistranslated:
commit 7e6b99340b40a06ec938c9d12b96d103c4e4a5ac
Author: Jake Goulding <jake.goulding@gmail.com>
Date: Wed Mar 7 19:31:17 2018 -0500
DTD SystemLiterals can contain any character
diff --git a/src/parser.rs b/src/parser.rs
index 995ba0d..b645f9c 100644
--- a/src/parser.rs
+++ b/src/parser.rs
@@ -481,8 +481,10 @@ fn parse_external_id<'a>(pm: &mut XmlMaster<'a>, xml: StringPoint<'a>)
let (xml, _) = try_parse!(xml.expect_literal("SYSTEM"));
let (xml, _) = try_parse!(xml.expect_space());
let (xml, external_id) = try_parse!(
- parse_quoted_value(pm, xml, |_, xml, _| xml.consume_name().map_err(|_| SpecificError::ExpectedSystemLiteral))
- );
+ parse_quoted_value(pm, xml, |_, xml, quote| {
+ xml.consume_attribute_value(quote).map_err(|_| SpecificError::ExpectedSystemLiteral)
+ })
+ );
success(external_id, xml)
}
@@ -1311,11 +1313,11 @@ mod test {
#[test]
fn a_prolog_with_a_document_type_declaration() {
- let package = quick_parse("<?xml version='1.0'?><!DOCTYPE doc SYSTEM \"doc.dtd\"><hello/>");
+ let package = quick_parse(r#"<?xml version="1.0"?><!DOCTYPE cXML SYSTEM "http://xml.cxml.org/schemas/cXML/1.2.014/cXML.dtd"><cXML />"#);
let doc = package.as_document();
let top = top(&doc);
- assert_qname_eq!(top.name(), "hello");
+ assert_qname_eq!(top.name(), "cXML");
}
#[test]
Oh, mm. :blush: I guess I was about to go on a wild goose chase.
Taking almost verbatim what was shown here, the test is passing.
In order to not be a total plagiarist, I'm going to try and work up a solution for the OP's case. Worth a shot at least!
I've added the example from @frp as a test (which is now passing). Not sure what else I can do outside of addressing any nitpicks.
sxd-document does not work if the document has internal DTD.
Trying to parse the example document (taken from here):
I get the following error:
(37, [Expected("SYSTEM")])
. 37 is the position of the first[
. Adding full parsing of internal DTD might be quite a big deal, but, for my purposes (parsing JMdict) just ignoring the DTD content would be fine.