shepmaster / sxd-document

An XML library in Rust
MIT License
152 stars 36 forks source link

Add at least rudimentary DTD handling #50

Open frp opened 7 years ago

frp commented 7 years ago

sxd-document does not work if the document has internal DTD.

Trying to parse the example document (taken from here):

<?xml version="1.0"?>
<!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend</body>
</note>

I get the following error: (37, [Expected("SYSTEM")]). 37 is the position of the first [. Adding full parsing of internal DTD might be quite a big deal, but, for my purposes (parsing JMdict) just ignoring the DTD content would be fine.

shepmaster commented 6 years ago

Mentoring Instructions

The DTD grammar and tokens are defined in the XML spec.

Work for this will probably largely live in parse_document_type_declaration:

https://github.com/shepmaster/sxd-document/blob/5e2253d9f8aa6ae7c4d4d7bcc83ebf0caada9e42/src/parser.rs#L492

For now, we aren't really worried about amazing performance or even collecting the data, all we need to make sure of is that the parser can accept the input data.

A small example to add to the tests:

<?xml version="1.0"?>
<!DOCTYPE cXML SYSTEM "http://xml.cxml.org/schemas/cXML/1.2.014/cXML.dtd">
<cXML />

Strangely, it kind of seems like this much of it should already be supported; I'm guessing that there's some small mistake in what the parser accepts.

onelson commented 6 years ago

OK, so I've been tinkering and learning. So far, I have found that the presence of a / character is enough to cause breakage.

I traced the failure back to parse_one_quoted_value

https://github.com/shepmaster/sxd-document/blob/5e2253d9f8aa6ae7c4d4d7bcc83ebf0caada9e42/src/parser.rs#L385-L387

I'm assuming that the 2nd try_parse!() here is meant to consume all characters up until the token passed in (in this case either a ' or a "), yet it's terminating prematurely at the /.

I don't really see a reason for this, but at this superficial level I suspect the bug is actually in the peresil crate which provides the try_parse!() macro. I may pivot over to write some tests for peresil and see if I can reproduce with a simple case.

shepmaster commented 6 years ago

It's actually because the grammar for parse_external_id got mistranslated:

commit 7e6b99340b40a06ec938c9d12b96d103c4e4a5ac
Author: Jake Goulding <jake.goulding@gmail.com>
Date:   Wed Mar 7 19:31:17 2018 -0500

    DTD SystemLiterals can contain any character

diff --git a/src/parser.rs b/src/parser.rs
index 995ba0d..b645f9c 100644
--- a/src/parser.rs
+++ b/src/parser.rs
@@ -481,8 +481,10 @@ fn parse_external_id<'a>(pm: &mut XmlMaster<'a>, xml: StringPoint<'a>)
     let (xml, _) = try_parse!(xml.expect_literal("SYSTEM"));
     let (xml, _) = try_parse!(xml.expect_space());
     let (xml, external_id) = try_parse!(
-        parse_quoted_value(pm, xml, |_, xml, _| xml.consume_name().map_err(|_| SpecificError::ExpectedSystemLiteral))
-        );
+        parse_quoted_value(pm, xml, |_, xml, quote| {
+            xml.consume_attribute_value(quote).map_err(|_| SpecificError::ExpectedSystemLiteral)
+        })
+    );

     success(external_id, xml)
 }
@@ -1311,11 +1313,11 @@ mod test {

     #[test]
     fn a_prolog_with_a_document_type_declaration() {
-        let package = quick_parse("<?xml version='1.0'?><!DOCTYPE doc SYSTEM \"doc.dtd\"><hello/>");
+        let package = quick_parse(r#"<?xml version="1.0"?><!DOCTYPE cXML SYSTEM "http://xml.cxml.org/schemas/cXML/1.2.014/cXML.dtd"><cXML />"#);
         let doc = package.as_document();
         let top = top(&doc);

-        assert_qname_eq!(top.name(), "hello");
+        assert_qname_eq!(top.name(), "cXML");
     }

     #[test]
onelson commented 6 years ago

Oh, mm. :blush: I guess I was about to go on a wild goose chase.

onelson commented 6 years ago

Taking almost verbatim what was shown here, the test is passing.

In order to not be a total plagiarist, I'm going to try and work up a solution for the OP's case. Worth a shot at least!

onelson commented 6 years ago

I've added the example from @frp as a test (which is now passing). Not sure what else I can do outside of addressing any nitpicks.