renggli / dart-xml

Lightweight library for parsing, traversing, and transforming XML in Dart.
http://pub.dartlang.org/packages/xml
MIT License
223 stars 52 forks source link

Edge case: Brackets within comments within DOCTYPE #144

Closed h7x4 closed 2 years ago

h7x4 commented 2 years ago

I'm currently working with a file that starts off like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE kanjidic2 [
    <!-- Version 1.6 - April 2008
    This is the DTD of the XML-format kanji file combining information from
    the KANJIDIC and KANJD212 files. It is intended to be largely self-
    documenting, with each field being accompanied by an explanatory
    comment.

    The file covers the following kanji:
    (a) the 6,355 kanji from JIS X 0208;
    (b) the 5,801 kanji from JIS X 0212;
    (c) the 3,693 kanji from JIS X 0213 as follows:
        (i) the 2,741 kanji which are also in JIS X 0212 have
        JIS X 0213 code-points (kuten) added to the existing entry;
        (ii) the 952 "new" kanji have new entries.

    At the end of the explanation for a number of fields there is a tag
    with the format [N]. This indicates the leading letter(s) of the
    equivalent field in the KANJIDIC and KANJD212 files.

    The KANJIDIC documentation should also be read for additional 
    information about the information in the file.
    -->
<!ELEMENT kanjidic2 (header,character*)>
<!ELEMENT header (file_version,database_version,date_of_creation)>
...

I'm getting the following error:

Unhandled exception:
XmlParserException: ">" expected at 18:21

I would guess this means that [N] is the part getting in the way, and that the parser recognizes the ] too early and expects an > to end the DOCTYPE element.

renggli commented 2 years ago

Thank you for reporting and analyzing the problem. The current implementation does not attempt to parse the DOCTYPE element and instead just tries to read over it. In this case that obviously fails, I'll be looking into it.

For reference, the doctype syntax is documented here: https://www.w3.org/TR/xml/#sec-prolog-dtd