sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.16k stars 902 forks source link

Segfault in parsing script or style tag #238

Closed ghost closed 14 years ago

ghost commented 14 years ago

This looks to me like a libxml2 error but I can't figure that out for sure and I'm experiencing it via Nokogiri 1.4.1. I'm hoping you can help. When linked against libxml2 2.7.3 Nokogiri successfully parses the page: http://starlite.ru . However, when linked against 2.7.4, 2.7.5, and 2.7.6 it seg faults. It appears to run way off the end of the buffer because it takes a few seconds before crashing. Here is the backtrace in GDB from Nokogiri linked against 2.7.5:

#0  htmlCurrentChar (ctxt=0x760c50, len=0x7ffffffeefcc) at HTMLparser.c:385
#1  0x00007ffff495a7a2 in htmlParseScript (ctxt=0x760c50) at HTMLparser.c:2833
#2  0x00007ffff495e7d0 in htmlParseContent (ctxt=0x760c50) at HTMLparser.c:4000
#3  0x00007ffff495ecf5 in htmlParseElement__internal_alias (ctxt=0x760c50) at HTMLparser.c:4197
#4  0x00007ffff495e75f in htmlParseContent (ctxt=0x760c50) at HTMLparser.c:4035
#5  0x00007ffff495ecf5 in htmlParseElement__internal_alias (ctxt=0x760c50) at HTMLparser.c:4197
#6  0x00007ffff495e75f in htmlParseContent (ctxt=0x760c50) at HTMLparser.c:4035
#7  0x00007ffff495ecf5 in htmlParseElement__internal_alias (ctxt=0x760c50) at HTMLparser.c:4197
#8  0x00007ffff495e75f in htmlParseContent (ctxt=0x760c50) at HTMLparser.c:4035
#9  0x00007ffff495eec7 in htmlParseDocument__internal_alias (ctxt=0x760c50) at HTMLparser.c:4324
#10 0x00007ffff495f1f8 in htmlDoRead (ctxt=0x760c50, URL=0x0, encoding=0x0, options=, reuse=0) at HTMLparser.c:6233
#11 0x00007ffff5092581 in read_memory (klass=140737324138360, string=140737313867040, url=4, encoding=4, options=4291) at html_document.c:95
#12 0x00007ffff7b1a041 in ?? () from /usr/lib/libruby1.8.so.1.8
#13 0x00007ffff7b1a233 in ?? () from /usr/lib/libruby1.8.so.1.8
#14 0x00007ffff7b16d4f in ?? () from /usr/lib/libruby1.8.so.1.8
#15 0x00007ffff7b19ee3 in ?? () from /usr/lib/libruby1.8.so.1.8
#16 0x00007ffff7b1a233 in ?? () from /usr/lib/libruby1.8.so.1.8
#17 0x00007ffff7b17031 in ?? () from /usr/lib/libruby1.8.so.1.8
#18 0x00007ffff7b24c8c in ?? () from /usr/lib/libruby1.8.so.1.8
#19 0x00007ffff7b15f75 in ?? () from /usr/lib/libruby1.8.so.1.8
#20 0x00007ffff7b19ee3 in ?? () from /usr/lib/libruby1.8.so.1.8
#21 0x00007ffff7b1a233 in ?? () from /usr/lib/libruby1.8.so.1.8
#22 0x00007ffff7b17031 in ?? () from /usr/lib/libruby1.8.so.1.8
#23 0x00007ffff7b19ee3 in ?? () from /usr/lib/libruby1.8.so.1.8
#24 0x00007ffff7b1a233 in ?? () from /usr/lib/libruby1.8.so.1.8
#25 0x00007ffff7b16d4f in ?? () from /usr/lib/libruby1.8.so.1.8
#26 0x00007ffff7b14c06 in ?? () from /usr/lib/libruby1.8.so.1.8
#27 0x00007ffff7b19ee3 in ?? () from /usr/lib/libruby1.8.so.1.8
---Type  to continue, or q  to quit---
#28 0x00007ffff7b1a233 in ?? () from /usr/lib/libruby1.8.so.1.8
#29 0x00007ffff7b17031 in ?? () from /usr/lib/libruby1.8.so.1.8
#30 0x00007ffff7b14c6c in ?? () from /usr/lib/libruby1.8.so.1.8
#31 0x00007ffff7b18307 in ?? () from /usr/lib/libruby1.8.so.1.8
#32 0x00007ffff7b146fa in ?? () from /usr/lib/libruby1.8.so.1.8
#33 0x00007ffff7b17491 in ?? () from /usr/lib/libruby1.8.so.1.8
#34 0x00007ffff7b19ee3 in ?? () from /usr/lib/libruby1.8.so.1.8
#35 0x00007ffff7b1a233 in ?? () from /usr/lib/libruby1.8.so.1.8
#36 0x00007ffff7b16d4f in ?? () from /usr/lib/libruby1.8.so.1.8
#37 0x00007ffff7b17976 in ?? () from /usr/lib/libruby1.8.so.1.8
#38 0x00007ffff7b19ee3 in ?? () from /usr/lib/libruby1.8.so.1.8
#39 0x00007ffff7b1a233 in ?? () from /usr/lib/libruby1.8.so.1.8
#40 0x00007ffff7b16d4f in ?? () from /usr/lib/libruby1.8.so.1.8
#41 0x00007ffff7b18307 in ?? () from /usr/lib/libruby1.8.so.1.8
#42 0x00007ffff7b24484 in rb_yield_values () from /usr/lib/libruby1.8.so.1.8
#43 0x00007ffff7b0744d in ?? () from /usr/lib/libruby1.8.so.1.8
#44 0x00007ffff7b18685 in ?? () from /usr/lib/libruby1.8.so.1.8
#45 0x00007ffff7b055ef in ?? () from /usr/lib/libruby1.8.so.1.8
#46 0x00007ffff7b1a041 in ?? () from /usr/lib/libruby1.8.so.1.8
#47 0x00007ffff7b1a233 in ?? () from /usr/lib/libruby1.8.so.1.8
#48 0x00007ffff7b1af18 in ?? () from /usr/lib/libruby1.8.so.1.8
#49 0x00007ffff7b1b1c5 in rb_funcall () from /usr/lib/libruby1.8.so.1.8
#50 0x00007ffff7b105a8 in rb_iterate () from /usr/lib/libruby1.8.so.1.8
#51 0x00007ffff7b06945 in ?? () from /usr/lib/libruby1.8.so.1.8
#52 0x00007ffff7b1a041 in ?? () from /usr/lib/libruby1.8.so.1.8
#53 0x00007ffff7b1a233 in ?? () from /usr/lib/libruby1.8.so.1.8
#54 0x00007ffff7b17031 in ?? () from /usr/lib/libruby1.8.so.1.8
#55 0x00007ffff7b17976 in ?? () from /usr/lib/libruby1.8.so.1.8
---Type  to continue, or q  to quit---
#56 0x00007ffff7b2712b in ?? () from /usr/lib/libruby1.8.so.1.8
#57 0x00007ffff7b27175 in ruby_exec () from /usr/lib/libruby1.8.so.1.8
#58 0x00007ffff7b271a5 in ruby_run () from /usr/lib/libruby1.8.so.1.8
#59 0x0000000000400911 in main ()

Here is an inspection of the current parameters:

(gdb) print *ctxt
$1 = {sax = 0x761820, userData = 0x760c50, myDoc = 0x8012d0, wellFormed = 0, replaceEntities = 0, version = 0x0, encoding = 0x0, 
  standalone = -1, html = 10, input = 0x761ad0, inputNr = 1, inputMax = 5, inputTab = 0x761930, node = 0xa974a0, nodeNr = 3, nodeMax = 40, 
  nodeTab = 0x8879b0, record_info = 0, node_seq = {maximum = 0, length = 0, buffer = 0x0}, errNo = 9, hasExternalSubset = 0, hasPErefs = 0, 
  external = 0, valid = 1, validate = 0, vctxt = {userData = 0x760c50, error = 0, warning = 0, node = 0x0, nodeNr = 0, nodeMax = 0, 
    nodeTab = 0x0, finishDtd = 2882343476, doc = 0x0, valid = 1, vstate = 0x0, vstateNr = 0, vstateMax = 0, vstateTab = 0x0, am = 0x0, 
    state = 0x0}, instate = XML_PARSER_START, token = 0, directory = 0x0, name = 0x7dad96 "script", nameNr = 3, nameMax = 40, 
  nameTab = 0x887860, nbChars = 556926, checkIndex = 0, keepBlanks = 1, disableSAX = 0, inSubset = 0, intSubName = 0x0, extSubURI = 0x0, 
  extSubSystem = 0x0, space = 0x761a20, spaceNr = 1, spaceMax = 10, spaceTab = 0x761a20, depth = 0, entity = 0x0, charset = 1, nodelen = 2, 
  nodemem = -1, pedantic = 0, _private = 0x0, loadsubset = 0, linenumbers = 1, catalogs = 0x0, recovery = 1, progressive = 0, dict = 0x760f20, 
  atts = 0x7c2f00, maxatts = 44, docdict = 0, str_xml = 0x0, str_xmlns = 0x0, str_xml_ns = 0x0, sax2 = 0, nsNr = 0, nsMax = 0, nsTab = 0x0, 
  attallocs = 0x0, pushTab = 0x0, attsDefault = 0x0, attsSpecial = 0x0, nsWellFormed = 1, options = 96, dictNames = 0, freeElemsNr = 0, 
  freeElems = 0x0, freeAttrsNr = 0, freeAttrs = 0x0, lastError = {domain = 5, code = 9, message = 0x11a37890 "Char 0x0 out of allowed range\n", 
    level = XML_ERR_ERROR, file = 0x0, line = 173, str1 = 0x0, str2 = 0x0, str3 = 0x0, int1 = 0, int2 = 544759, ctxt = 0x760c50, node = 0x0}, 
  parseMode = XML_PARSE_UNKNOWN, nbentities = 0, sizeentities = 0}
(gdb) print *len
$2 = 1

It appears to me that an error has been registered but it just keeps running.

Cheers!

tenderlove commented 14 years ago

blech. looks terrible. Thanks for the bug report!

tenderlove commented 14 years ago

I can't reproduce this against libxml2 version 2.7.7. Can you try that version? Also, can you add a sample script to reproduce the problem?

ghost commented 14 years ago

Ah, it seems 2.7.7 was released on the 15th. I'll pull it down, give it a try and post results. Thanks!

ghost commented 14 years ago

I likewise am unable to replicate this issue with LibXML 2.7.7 installed. It seems it was an issue in the library, not Nokigiri. I really appreciate you looking at this. I consider this closed. Cheers!

tenderlove commented 14 years ago

Great! Thanks! http://ihighfive.com/