Open andiwand opened 5 years ago
hi @andiwand
#include <cassert>
#include "tinyxml2.h"
int main() {
XMLDocument doc;
doc.Parse("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
"<!--that was a xml declaration-->"
"<root>"
// "<?pi_target pi_content?>"
"<!--that was a processing instruction-->"
"</root>");
printf("%d",doc.ErrorID()); ----success:doc.ErrorID()=0,fail:doc.ErrorID()=11
the situation can parse success When there is a declaration inside the node, the xml parsing fails. I personally think this may be a bug, because when I use libxml2, this situation can be successfully parsed.
hi @Alanscut ! the problem is that tinyxml2 classifies the statement <?pi_target pi_content?>
as an xml declaration but it is not, it is a processing instruction and valid xml. this is a bug because there is no distinction between those two things.
in my case i cannot simply remove the processing instruction from the input because i read it from a file.
`
// These strings define the matching patterns:
static const char xmlHeader = { "<?xml" };
static const char ProcessingInstructionHeader = { "<?" };
static const char commentHeader = { "<!--" };
static const char cdataHeader = { "<![CDATA[" };
static const char dtdHeader = { "<!" };
static const char elementHeader = { "<" }; // and a header for everything else; check last.
static const int xmlHeaderLen = 5;
static const int ProcessingInstructionHeaderLen = 2;
static const int commentHeaderLen = 4;
static const int cdataHeaderLen = 9;
static const int dtdHeaderLen = 2;
static const int elementHeaderLen = 1;
TIXMLASSERT( sizeof( XMLComment ) == sizeof( XMLUnknown ) ); // use same memory pool
TIXMLASSERT( sizeof( XMLComment ) == sizeof( XMLDeclaration ) ); // use same memory pool
XMLNode* returnNode = 0;
if ( XMLUtil::StringEqual( p, xmlHeader, xmlHeaderLen ) ) {
returnNode = CreateUnlinkedNode<XMLDeclaration>( _commentPool );
returnNode->_parseLineNum = _parseCurLineNum;
p += xmlHeaderLen;
}
else if ( XMLUtil::StringEqual( p, ProcessingInstructionHeader, ProcessingInstructionHeaderLen ) ) {
returnNode = CreateUnlinkedNode<XMLUnknown>( _commentPool ); // it needs to be XMLUnknown
returnNode->_parseLineNum = _parseCurLineNum;
p += ProcessingInstructionHeaderLen;
}
`
Just want to add to my proposal that according to the spec for processing instructions https://www.w3.org/TR/REC-xml/#sec-pi static const char xmlHeader = { "<?xml" }; static const char xmlHeader = { "<?XML" }; XML in capital letters needs to be checked, too.
Well, according to the spec, if I read it correctly, PI which is XML declaration must be only ?xml (lowercase), but in other Processing instructions names like xml and XML, and mixed case like xMl XmL etc are disallowed.
Hence it is enough to check only against <?xml
EDIT: link to xml decl: https://www.w3.org/TR/REC-xml/#sec-prolog-dtd
This failure to handle processing instructions is causing me grief. Any chance of a fix? In the interim I'm going to have to preprocess the file and delete them.
Test XML:
Can be validated. (https://codebeautify.org/xmlvalidator)
Test code:
Hotfix: https://github.com/leethomason/tinyxml2/compare/master...andiwand:hotfix/processing-instruction
References: