leethomason / tinyxml2

TinyXML2 is a simple, small, efficient, C++ XML parser that can be easily integrated into other programs.
zlib License
5.07k stars 1.83k forks source link

Parsing fails because of processing instruction #777

Open andiwand opened 5 years ago

andiwand commented 5 years ago

Test XML:

<?xml version="1.0" encoding="UTF-8"?>
<!--that was a xml declaration-->
<root>
    <?pi_target pi_content?>
    <!--that was a processing instruction-->
</root>

Can be validated. (https://codebeautify.org/xmlvalidator)

Test code:

#include <cassert>
#include "tinyxml2.h"

int main() {
    tinyxml2::XMLDocument doc;
    tinyxml2::XMLError error = doc.Parse("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
              "<!--that was a xml declaration-->"
              "<root>"
              "<?pi_target pi_content?>"
              "<!--that was a processing instruction-->"
              "</root>");
    // valid according to https://codebeautify.org/xmlvalidator

    assert(error == tinyxml2::XMLError::XML_SUCCESS);
    assert(doc.FirstChild()->ToDeclaration() != nullptr);
    assert(doc.FirstChild()->NextSibling()->ToComment() != nullptr);
    assert(doc.FirstChild()->NextSibling()->NextSibling()->ToElement() != nullptr);
    //assert(doc.FirstChild()->NextSibling()->NextSibling()->ToElement()->FirstChild()->ToUnknown() != nullptr);

    return 0;
}

Hotfix: https://github.com/leethomason/tinyxml2/compare/master...andiwand:hotfix/processing-instruction

References:

Alanscut commented 5 years ago

hi @andiwand

#include <cassert>
#include "tinyxml2.h"

int main() {
    XMLDocument doc;
    doc.Parse("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
              "<!--that was a xml declaration-->"
              "<root>"
    //          "<?pi_target pi_content?>"
              "<!--that was a processing instruction-->"
              "</root>");
printf("%d",doc.ErrorID()); ----success:doc.ErrorID()=0,fail:doc.ErrorID()=11

the situation can parse success When there is a declaration inside the node, the xml parsing fails. I personally think this may be a bug, because when I use libxml2, this situation can be successfully parsed.

andiwand commented 5 years ago

hi @Alanscut ! the problem is that tinyxml2 classifies the statement <?pi_target pi_content?> as an xml declaration but it is not, it is a processing instruction and valid xml. this is a bug because there is no distinction between those two things. in my case i cannot simply remove the processing instruction from the input because i read it from a file.

ReinholdH commented 4 years ago

Hi, Thx for the posts. I ran into the same issue. Here is a simple proposal for a fix in tinyxml2.cpp function XMLDocument::Identify May be somebody can validate it and include the changes into the next update.

` // These strings define the matching patterns: static const char xmlHeader = { "<?xml" };
static const char
ProcessingInstructionHeader = { "<?" };
static const char commentHeader = { "<!--" }; static const char cdataHeader = { "<![CDATA[" }; static const char dtdHeader = { "<!" }; static const char elementHeader = { "<" }; // and a header for everything else; check last.

static const int xmlHeaderLen       = 5;                        
static const int ProcessingInstructionHeaderLen = 2;        
static const int commentHeaderLen   = 4;
static const int cdataHeaderLen     = 9;
static const int dtdHeaderLen       = 2;
static const int elementHeaderLen   = 1;

TIXMLASSERT( sizeof( XMLComment ) == sizeof( XMLUnknown ) );        // use same memory pool
TIXMLASSERT( sizeof( XMLComment ) == sizeof( XMLDeclaration ) );    // use same memory pool
XMLNode* returnNode = 0;
if ( XMLUtil::StringEqual( p, xmlHeader, xmlHeaderLen ) ) {
    returnNode = CreateUnlinkedNode<XMLDeclaration>( _commentPool );
    returnNode->_parseLineNum = _parseCurLineNum;
    p += xmlHeaderLen;
}
else if ( XMLUtil::StringEqual( p, ProcessingInstructionHeader, ProcessingInstructionHeaderLen ) ) {
    returnNode = CreateUnlinkedNode<XMLUnknown>( _commentPool );    // it needs to be XMLUnknown        
    returnNode->_parseLineNum = _parseCurLineNum;                                                   
    p += ProcessingInstructionHeaderLen;                                                            
}

`

ReinholdH commented 4 years ago

Just want to add to my proposal that according to the spec for processing instructions https://www.w3.org/TR/REC-xml/#sec-pi static const char xmlHeader = { "<?xml" }; static const char xmlHeader = { "<?XML" }; XML in capital letters needs to be checked, too.

omgtehlion commented 4 years ago

Well, according to the spec, if I read it correctly, PI which is XML declaration must be only ?xml (lowercase), but in other Processing instructions names like xml and XML, and mixed case like xMl XmL etc are disallowed.

Hence it is enough to check only against <?xml

EDIT: link to xml decl: https://www.w3.org/TR/REC-xml/#sec-prolog-dtd

rpatters1 commented 3 weeks ago

This failure to handle processing instructions is causing me grief. Any chance of a fix? In the interim I'm going to have to preprocess the file and delete them.