graetzer / GDataXML-HTML

HTML and XML parser for iOS and OSX, supports XPath
Apache License 2.0
289 stars 121 forks source link

Crash with invalid xml #14

Closed artemyarulin closed 8 years ago

artemyarulin commented 8 years ago

Following xml causing a crash:

<?xml version="1.0"?>
<!DOCTYPE EXAMPLE SYSTEM "example.dtd" [
<!ENTITY xml "<prefix:node>prefix is indeclared here</prefix:node>">
]>
<EXAMPLE xmlns:prefix="http://example.com">
  &xml;
</EXAMPLE>

Actually in debug version following assert https://github.com/graetzer/GDataXML-HTML/blob/master/Pod/Classes/GDataXMLNode.m#L375-L379 will fail:

- (NSString *)stringFromXMLString:(const xmlChar *)chars {

#if DEBUG
    NSCAssert(chars != NULL, @"GDataXMLNode sees an unexpected empty string");
#endif
    if (chars == NULL) return nil;

If we remove it (or use a release version) lib will crash later on in https://github.com/graetzer/GDataXML-HTML/blob/master/Pod/Classes/GDataXMLNode.m#L582-L586 in the if check:

  if (asprintf(&qname, "%s:%s", (const char *)xmlNode_->ns->prefix,
                         xmlNode_->name) != -1) {
                str = [self stringFromXMLString:(const xmlChar *)qname];
                free(qname);
            }
artemyarulin commented 8 years ago

Hm, looks like assert is not related - even if I disable it and return empty string instead of nil it will crash

UPD: My noob skills are too low to handle this by myself. As I understood it crashes inside asprintf in strlen(xmlNode_->ns->prefix) causing EXC_BAD_ACCESS, which probably means the string is not null terminated e.g. pointing to garbage. I guess.

Here backtrace:

* thread #6: tid = 0x2f1b8b, 0x000000010f5c6c32 libsystem_c.dylib`strlen + 18, queue = 'com.facebook.React.RNMXmlQueue', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x000000010f5c6c32 libsystem_c.dylib`strlen + 18
    frame #1: 0x000000010f60c96c libsystem_c.dylib`__vfprintf + 5713
    frame #2: 0x000000010f63449d libsystem_c.dylib`__v2printf + 609
    frame #3: 0x000000010f60ae40 libsystem_c.dylib`_vasprintf + 569
    frame #4: 0x000000010f601eff libsystem_c.dylib`asprintf + 201
  * frame #5: 0x000000010bc463ff kohRN`-[GDataXMLNode qualifiedName](self=0x00007fa522e1b9e0, _cmd="qualifiedName") + 319 at GDataXMLNode.m:582
    frame #6: 0x000000010bc4652c kohRN`-[GDataXMLNode name](self=0x00007fa522e1b9e0, _cmd="name") + 92 at GDataXMLNode.m:602
    frame #7: 0x000000010bd0262f kohRN`+[RNMXml readNode:](self=RNMXml, _cmd="readNode:", node=0x00007fa522e1b9e0) + 1551 at RNMXml.m:122
    frame #8: 0x000000010bd02d17 kohRN`__19+[RNMXml readNode:]_block_invoke106(.block_descriptor=0x000070000029f0a0, child=0x00007fa522e1b9e0, idx=0, stop=NO) + 103 at RNMXml.m:119
    frame #9: 0x000000010cff0a42 CoreFoundation`__53-[__NSArrayM enumerateObjectsWithOptions:usingBlock:]_block_invoke + 114
    frame #10: 0x000000010cff0132 CoreFoundation`-[__NSArrayM enumerateObjectsWithOptions:usingBlock:] + 194
    frame #11: 0x000000010bd025ed kohRN`+[RNMXml readNode:](self=RNMXml, _cmd="readNode:", node=0x00007fa522ed9c70) + 1485 at RNMXml.m:118
    frame #12: 0x000000010bd02d17 kohRN`__19+[RNMXml readNode:]_block_invoke106(.block_descriptor=0x000070000029f3e0, child=0x00007fa522ed9c70, idx=1, stop=NO) + 103 at RNMXml.m:119
    frame #13: 0x000000010cff0a42 CoreFoundation`__53-[__NSArrayM enumerateObjectsWithOptions:usingBlock:]_block_invoke + 114
    frame #14: 0x000000010cff015f CoreFoundation`-[__NSArrayM enumerateObjectsWithOptions:usingBlock:] + 239
    frame #15: 0x000000010bd025ed kohRN`+[RNMXml readNode:](self=RNMXml, _cmd="readNode:", node=0x00007fa522ee4d80) + 1485 at RNMXml.m:118
    frame #16: 0x000000010bd01f8e kohRN`+[RNMXml parseString:isHtml:error:](self=RNMXml, _cmd="parseString:isHtml:error:", string=@"<?xml version=\"1.0\"?>\n<!DOCTYPE EXAMPLE SYSTEM \"example.dtd\" [\n<!ENTITY xml \"<prefix:node>prefix is indeclared here</prefix:node>\">\n]>\n<EXAMPLE xmlns:prefix=\"http://example.com\">\n  &xml;\n</EXAMPLE>\n", isHtml=NO, error=domain: class name = __NSMallocBlock__ - code: 0) + 734 at RNMXml.m:94
    frame #17: 0x000000010bd0107a kohRN`-[RNMXml parseString:isHtml:callback:](self=0x00007fa522d0b240, _cmd="parseString:isHtml:callback:", string=@"<?xml version=\"1.0\"?>\n<!DOCTYPE EXAMPLE SYSTEM \"example.dtd\" [\n<!ENTITY xml \"<prefix:node>prefix is indeclared here</prefix:node>\">\n]>\n<EXAMPLE xmlns:prefix=\"http://example.com\">\n  &xml;\n</EXAMPLE>\n", isHtml=NO, callback=(kohRN`__41-[RCTModuleMethod processMethodSignature]_block_invoke_3 at RCTModuleMethod.m:174)) + 170 at RNMXml.m:29
    frame #18: 0x000000010cfb07ac CoreFoundation`__invoking___ + 140
    frame #19: 0x000000010cfb05fe CoreFoundation`-[NSInvocation invoke] + 286
    frame #20: 0x000000010d040f16 CoreFoundation`-[NSInvocation invokeWithTarget:] + 54
    frame #21: 0x000000010bca6d21 kohRN`-[RCTModuleMethod invokeWithBridge:module:arguments:](self=0x00007fa522d3fe20, _cmd="invokeWithBridge:module:arguments:", bridge=0x00007fa522f0aaa0, module=0x00007fa522d0b240, arguments=@"3 objects") + 1921 at RCTModuleMethod.m:419
    frame #22: 0x000000010bc612f2 kohRN`-[RCTBatchedBridge _handleRequestNumber:moduleID:methodID:params:](self=0x00007fa522f0aaa0, _cmd="_handleRequestNumber:moduleID:methodID:params:", i=0, moduleID=25, methodID=2, params=@"3 objects") + 802 at RCTBatchedBridge.m:737
    frame #23: 0x000000010bc60b1a kohRN`__34-[RCTBatchedBridge _handleBuffer:]_block_invoke(.block_descriptor=0x00007fa522ee2d10) + 826 at RCTBatchedBridge.m:685
    frame #24: 0x000000010f531d59 libdispatch.dylib`_dispatch_call_block_and_release + 12
    frame #25: 0x000000010f54d4a7 libdispatch.dylib`_dispatch_client_callout + 8
    frame #26: 0x000000010f5370e8 libdispatch.dylib`_dispatch_queue_drain + 1048
    frame #27: 0x000000010f536aa0 libdispatch.dylib`_dispatch_queue_invoke + 595
    frame #28: 0x000000010f5383b8 libdispatch.dylib`_dispatch_root_queue_drain + 565
    frame #29: 0x000000010f53817c libdispatch.dylib`_dispatch_worker_thread3 + 98
    frame #30: 0x000000010f8794f2 libsystem_pthread.dylib`_pthread_wqthread + 1129
    frame #31: 0x000000010f877375 libsystem_pthread.dylib`start_wqthread + 13
graetzer commented 8 years ago

Can you print the content of xmlNode_->name or of xmlNode_->ns->prefix when you debug it in XCode?

I guess one of these values has to be garbage, during debugging you could try the command window to evaluate the expression strl(xmlNode_->ns->prefix)

artemyarulin commented 8 years ago

I guess you meant strlen?

Here the output:

...
namespace warning : Namespace prefix prefix was not found
<prefix:node>prefix is indeclared here</prefix:node>
            ^

(lldb) po xmlNode_->ns->prefix
""
(lldb) po xmlNode_->name
"xml"
(lldb) po strl(xmlNode_->ns->prefix)
error: use of undeclared identifier 'strl'
error: 1 errors parsing expression
(lldb) po strlen(xmlNode_->name)
0x0000000000000003
(lldb) po strlen(xmlNode_->ns->prefix)
error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=EXC_I386_GPFLT).
The process has been returned to the state before expression evaluation.
error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=EXC_I386_GPFLT).
The process has been returned to the state before expression evaluation.
(lldb) print xmlNode_->ns->prefix
(const xmlChar *) $4 = 0x6572616c6365646e ""
(lldb) print xmlNode_->name
(const xmlChar *) $5 = 0x00007fb153921c28 "xml"
graetzer commented 8 years ago

I don't get it, how can strlen crash, but the debugger displays an empty string ? Anyway it would be interesting to know where this value becomes corrupted

artemyarulin commented 8 years ago

I've created a branch ns-crash with test case which causes a crash, here the change: https://github.com/graetzer/GDataXML-HTML/commit/84fac869aa1341e3a5f120f3aed26df429a4942a

I'm recursively reading all the nodes in order to cause it

UPD: And regarding debugger - I have no idea :)

graetzer commented 8 years ago

So technically there is not much that happens inside the GDataXMLNode class right? The - (NSArray*)children method extracts all the child nodes from the xmlNodePtr structure and wraps them in a GDataXMLNode object.

This could a bug in the libxml2 library or we're not allowed to use this field's value in this case. What's the value of xmlNode_->type where it crashes?

artemyarulin commented 8 years ago
(lldb) po xmlNode_->name
"xml"

(lldb) po xmlNode_->type
XML_ENTITY_DECL

So yeah, I also wasn't able to find any code which creates this ns->prefix so I assume it is coming from libxml

graetzer commented 8 years ago

Maybe we should special handle it like the XML_NAMESPACE_DECL and just return the name? Not sure what the proper qualified name is for an xml entity

artemyarulin commented 8 years ago

Yep it fixes a problem - I've created a pull request #15, have a look whenever you have time