sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.15k stars 899 forks source link

SIGSEGV while parsing XML using Nokogiri::XML::Reader #439

Closed packetmonkey closed 13 years ago

packetmonkey commented 13 years ago

When parsing a large ( 7.2 gig ) XMl file using Nokogiri::XML::Reader, I can reliably cause a segfault. Below is the rubinius crash report. I can generate a crash using matz ruby 1.8.7-p330, 1.9.2-p136 and rubinius 1.2.4dev. Below is the crash report generated by rubinius. I also can generate this error in both the latest stable version as well as version 1.5.0.beta.4 (which is what I used to generate this crash report).

Rubinius Crash Report #rbxcrashreport

Error: signal SIGSEGV

[[Backtrace]]
0   ruby                                0x0000000100021f20 _ZN8rubiniusL12segv_handlerEi + 160
1   libSystem.B.dylib                   0x00007fff879c367a _sigtramp + 26
2   ???                                 0x00007fff5fbf9890 0x0 + 140734799779984
3   ruby                                0x000000010009d50a _ZN8rubinius10Primitives11symbol_to_sEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 218
4   ruby                                0x00000001000349cc _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 412
5   ruby                                0x00000001001152fb _ZN8rubinius8VMMethod19execute_specializedINS_16GenericArgumentsEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 699
6   ruby                                0x0000000100037aa0 _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12912
7   ruby                                0x000000010011b5bf _ZN8rubinius16BlockEnvironment19execute_interpreterEPNS_2VMEPNS_9CallFrameEPS0_RNS_9ArgumentsERNS_15BlockInvocationE + 495
8   ruby                                0x000000010011b8c4 _ZN8rubinius16BlockEnvironment4callEPNS_2VMEPNS_9CallFrameERNS_9ArgumentsEi + 68
9   ruby                                0x000000010003926f _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 19007
10  ruby                                0x000000010011b5bf _ZN8rubinius16BlockEnvironment19execute_interpreterEPNS_2VMEPNS_9CallFrameEPS0_RNS_9ArgumentsERNS_15BlockInvocationE + 495
11  ruby                                0x000000010011b8c4 _ZN8rubinius16BlockEnvironment4callEPNS_2VMEPNS_9CallFrameERNS_9ArgumentsEi + 68
12  ruby                                0x000000010021890b rbx_yield_stack + 235
13  ???                                 0x0000000102c48e00 0x0 + 4341403136
14  ruby                                0x00000001000379c1  _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12689
15  ruby                                0x0000000100114e90 _ZN8rubinius8VMMethod19execute_specializedINS_11NoArgumentsEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 480
16  ruby                                0x00000001000379c1 _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12689
17  ruby                                0x0000000100114e90 _ZN8rubinius8VMMethod19execute_specializedINS_11NoArgumentsEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 480
18  ruby                                0x0000000100037aa0 _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12912
19  ruby                                0x000000010011631f _ZN8rubinius8VMMethod19execute_specializedINS_11OneArgumentEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 495
20  ruby                                0x0000000100037aa0 _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12912
21  ruby                                0x000000010011b5bf _ZN8rubinius16BlockEnvironment19execute_interpreterEPNS_2VMEPNS_9CallFrameEPS0_RNS_9ArgumentsERNS_15BlockInvocationE + 495
22  ruby                                0x000000010011b8c4 _ZN8rubinius16BlockEnvironment4callEPNS_2VMEPNS_9CallFrameERNS_9ArgumentsEi + 68
23  ruby                                0x000000010016e585 _ZN8rubinius4Proc9call_primEPNS_2VMEPNS_10ExecutableEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 469
24  ruby                                0x00000001000ac493 _ZN8rubinius10Primitives9proc_callEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 195
25  ruby                                0x0000000100037aa0 _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12912
26  ruby                                0x0000000100114e90 _ZN8rubinius8VMMethod19execute_specializedINS_11NoArgumentsEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 480
27  ruby                                0x00000001001211bb _ZN8rubinius14CompiledMethod16default_executorEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 155
28  ruby                                0x00000001000379c1 _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12689
29  ruby                                0x0000000100114e90 _ZN8rubinius8VMMethod19execute_specializedINS_11NoArgumentsEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 480
30  ruby                                0x00000001001211bb _ZN8rubinius14CompiledMethod16default_executorEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 155
31  ruby                                0x0000000100141591 _ZN8rubinius6Object9send_primEPNS_2VMEPNS_10ExecutableEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 257
32  ruby                                0x00000001000b5dfb _ZN8rubinius10Primitives11object_sendEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 123
33  ruby                                0x0000000100037aa0 _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12912
34  ruby                                0x000000010011631f _ZN8rubinius8VMMethod19execute_specializedINS_11OneArgumentEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 495
35  ruby                                0x00000001001211bb _ZN8rubinius14CompiledMethod16default_executorEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 155
36  ruby                                0x0000000100037aa0 _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12912
37  ruby                                0x00000001001152fb _ZN8rubinius8VMMethod19execute_specializedINS_16GenericArgumentsEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 699
38  ruby                                0x00000001001211bb _ZN8rubinius14CompiledMethod16default_executorEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 155
39  ruby                                0x0000000100037aa0 _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12912
40  ruby                                0x0000000100114e90 _ZN8rubinius8VMMethod19execute_specializedINS_11NoArgumentsEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 480
41  ruby                                0x00000001001211bb _ZN8rubinius14CompiledMethod16default_executorEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 155
42  ruby                                0x00000001000364eb _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 7355
43  ruby                                0x0000000100114e90 _ZN8rubinius8VMMethod19execute_specializedINS_11NoArgumentsEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 480
44  ruby                                0x00000001001211bb _ZN8rubinius14CompiledMethod16default_executorEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 155
45  ruby                                0x0000000100037aa0 _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12912
46  ruby                                0x0000000100114e90 _ZN8rubinius8VMMethod19execute_specializedINS_11NoArgumentsEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 480
47  ruby                                0x00000001001211bb _ZN8rubinius14CompiledMethod16default_executorEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 155
48  ruby                                0x0000000100037aa0 _ZN8rubinius8VMMethod11interpreterEPNS_2VMEPS0_PNS_20InterpreterCallFrameE + 12912
49  ruby                                0x0000000100114e90 _ZN8rubinius8VMMethod19execute_specializedINS_11NoArgumentsEEEPNS_6ObjectEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 480
50  ruby                                0x00000001001211bb _ZN8rubinius14CompiledMethod16default_executorEPNS_2VMEPNS_9CallFrameERNS_8DispatchERNS_9ArgumentsE + 155
51  ruby                                0x000000010001f20f _ZN8rubinius12CompiledFile7executeEPNS_2VME + 351
52  ruby                                0x000000010002261c _ZN8rubinius11Environment8run_fileESs + 396
53  ruby                                0x000000010002535c _ZN8rubinius11Environment19run_from_filesystemESs + 252
54  ruby                                0x00000001001bdff6 main + 646
55  ruby                                0x0000000100013ca8 start + 52

[[System Info]]
sysname: Darwin
nodename: Evans-iMac.local
release: 10.6.0
version: Darwin Kernel Version 10.6.0: Wed Nov 10 18:13:17 PST 2010; root:xnu-1504.9.26~3/RELEASE_I386
machine: i386
packetmonkey commented 13 years ago

Would it be helpful if I posted the backtraces from the other ruby implementations?

flavorjones commented 13 years ago

Greetings!

Thanks very much for reporting this issue. Here's what we need to make it actionable:

  1. the output from nokogiri -v (which will tell us what version of libxml you're running, among other things)
  2. a script to reproduce this error, along with any data that you're loading

If we can't reproduce it, we can't fix it! Thanks.

packetmonkey commented 13 years ago

Here is the nokogiri -v information. Unfortunately the data set I'm currently using is a 7.2 gig xml export of a production database that I can't readily share (not to mention it's huge). I'll try to narrow down what i'm doing in the script to trigger the action and possibly try to hunt down the data in the xml that's causing it from that side.

--- 
warnings: []

nokogiri: 1.5.0.beta.4
libxml: 
  loaded: 2.7.3
  binding: extension
  compiled: 2.7.3
ruby: 
  platform: x86_64-apple-darwin10.6.0
  version: 1.8.7
  engine: rbx
flavorjones commented 13 years ago

Thanks. If reproducing with less data proves difficult, it would probably be useful to run your code under valgrind, and send us the output. I like to do:

valgrind --partial-loads-ok=yes --undef-value-errors=no ruby your-script.rb 2>&1 | tee valgrind.log
flavorjones commented 13 years ago

Also, feel free to contact me off-the-record if you need help. mike.dalessio@gmail.com

packetmonkey commented 13 years ago

Here is the valgrind output. I used matz 1.9.2 because it had less noise than rubinius.

==11129== Memcheck, a memory error detector
==11129== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==11129== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==11129== Command: ruby measuring_cup_load.rb /Users/evansharp/Desktop/Storefront_20110307_DB.xml
==11129== 
--11129-- /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/bin/ruby:
--11129-- dSYM directory is missing; consider using --dsymutil=yes
--11129-- /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib:
--11129-- dSYM directory is missing; consider using --dsymutil=yes
==11129== Warning: ignored attempt to set SIGUSR2 handler in sigaction();
==11129==          the SIGUSR2 signal is used internally by Valgrind
UNKNOWN __pthread_sigmask is unsupported. This warning will not be repeated.
--11129-- /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/x86_64-darwin10.6.0/enc/encdb.bundle:
--11129-- dSYM directory is missing; consider using --dsymutil=yes
--11129-- /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/x86_64-darwin10.6.0/enc/trans/transdb.bundle:
--11129-- dSYM directory is missing; consider using --dsymutil=yes
--11129-- /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/x86_64-darwin10.6.0/etc.bundle:
--11129-- dSYM directory is missing; consider using --dsymutil=yes
--11129-- /Users/evansharp/.rvm/gems/ruby-1.9.2-p136/gems/nokogiri-1.4.4/lib/nokogiri/nokogiri.bundle:
--11129-- dSYM directory is missing; consider using --dsymutil=yes
--11129-- /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/x86_64-darwin10.6.0/stringio.bundle:
--11129-- dSYM directory is missing; consider using --dsymutil=yes
--11129-- /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/x86_64-darwin10.6.0/racc/cparse.bundle:
--11129-- dSYM directory is missing; consider using --dsymutil=yes
--11129-- /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/x86_64-darwin10.6.0/strscan.bundle:
--11129-- dSYM directory is missing; consider using --dsymutil=yes

**Notice: C extension not loaded. This is required for optimum MongoDB Ruby driver performance.
  You can install the extension as follows:
  gem install bson_ext

  If you continue to receive this message after installing, make sure that the
  bson_ext gem is in your load path and that the bson_ext and mongo gems are of the same version.

--11129-- /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/x86_64-darwin10.6.0/socket.bundle:
--11129-- dSYM directory is missing; consider using --dsymutil=yes
--11129-- /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/x86_64-darwin10.6.0/digest/md5.bundle:
--11129-- dSYM directory is missing; consider using --dsymutil=yes
--11129-- /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/x86_64-darwin10.6.0/digest.bundle:
--11129-- dSYM directory is missing; consider using --dsymutil=yes
==11129== Invalid read of size 1
==11129==    at 0x10012869A: st_lookup (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x10018DF60: vm_exec_core (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x100190AE2: vm_exec (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x10019DF8F: rb_yield (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x10002ADC1: rb_ary_each (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x10019CC92: vm_call_method (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x100188E43: vm_exec_core (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x100190AE2: vm_exec (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x100191E5C: rb_vm_invoke_proc (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x10019CC92: vm_call_method (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x100188E43: vm_exec_core (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x100190AE2: vm_exec (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
==11129== 
==11129== 
==11129== Process terminating with default action of signal 11 (SIGSEGV)
==11129==  General Protection Fault
==11129==    at 0x1003E4299: dyld_stub_binder (in /usr/lib/libSystem.B.dylib)
==11129==    by 0x10023300F: ??? (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x13804E323: ???
==11129==    by 0x10018DF60: vm_exec_core (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x100190AE2: vm_exec (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x10019DF8F: rb_yield (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x10002ADC1: rb_ary_each (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x10019CC92: vm_call_method (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x100188E43: vm_exec_core (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x100190AE2: vm_exec (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x100191E5C: rb_vm_invoke_proc (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
==11129==    by 0x10019CC92: vm_call_method (in /Users/evansharp/.rvm/rubies/ruby-1.9.2-p136/lib/libruby.1.9.1.dylib)
--11129:0:schedule VG_(sema_down): read returned -4
==11129== 
==11129== HEAP SUMMARY:
==11129==     in use at exit: 9,947,158 bytes in 53,831 blocks
==11129==   total heap usage: 241,683 allocs, 187,852 frees, 33,966,073 bytes allocated
==11129== 
==11129== LEAK SUMMARY:
==11129==    definitely lost: 20 bytes in 2 blocks
==11129==    indirectly lost: 0 bytes in 0 blocks
==11129==      possibly lost: 0 bytes in 0 blocks
==11129==    still reachable: 9,947,138 bytes in 53,829 blocks
==11129==         suppressed: 0 bytes in 0 blocks
==11129== Rerun with --leak-check=full to see details of leaked memory
==11129== 
==11129== For counts of detected and suppressed errors, rerun with: -v
==11129== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
packetmonkey commented 13 years ago

After a bit of work I have managed to narrow down where in the XML file slightly. The first round of tests showed the segfault occurring at the 56909th user record which is a little over 57k lines into this 20794220 line xml file. However a later test on an almost functionally identical script had the crash occurring after user 56121.

Changes like the ones I show below also seem to change where the script crashes. In this case, allowing the script to parse the full XML file (although in all fairness I don't do any work with what is parsed later in the file in this test script).

reader = Nokogiri::XML::Reader File.new('../Storefront_20110307_DB.xml', 'r')
reader.each do |node|
  import_identity(node) and next if node.name == 'PFWeb::Identities__Row'
end

Changes to

reader = Nokogiri::XML::Reader File.new('../Storefront_20110307_DB.xml', 'r')
reader.each_with_index do |node,i|
    puts "Line: #{i+1}"
    import_identity(node) and next if node.name == 'PFWeb::Identities__Row'
end

Where import_identity is the code I use to parse and import that node, which hasn't changed during this revision. Commenting out the `puts "Line: #{i+1}" line causes the script to once again crash after user 56121.

Interesting isn't it? :)

I am however at a bit of a loss on how to continue to narrow down the problem as it doesn't appear to be an error in the actual XML parser so much as whatever internal memory management is done inside the compiled extension. How can I continue to help get this resolved?

flavorjones commented 13 years ago

Can you please provide valgrind output from using 1.8.7 on a Linux machine? Preferably using a debug build of ruby, but that's not absolutely necessary.

Again, useful valgrind options to reduce the "noisiness" of the output are:

valgrind --partial-loads-ok=yes --undef-value-errors=no ruby your-script.rb
packetmonkey commented 13 years ago

This was run using ruby 1.8.7 (2010-01-10 patchlevel 249) [x86_64-linux] on Ubuntu 10.04.1

==6859== Memcheck, a memory error detector
==6859== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==6859== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==6859== Command: ruby nokoseg.rb
==6859== 
==6859== Invalid read of size 8
==6859==    at 0x4EC8E23: st_lookup (st.c:250)
==6859==    by 0x4E5FEDE: search_method (eval.c:486)
==6859==    by 0x4E5FF66: rb_get_method_body (eval.c:507)
==6859==    by 0x4E6B789: rb_call (eval.c:6150)
==6859==    by 0x4E68690: rb_eval (eval.c:3506)
==6859==    by 0x4E69DE8: rb_yield_0 (eval.c:5095)
==6859==    by 0x4E483F0: rb_ary_each (array.c:1261)
==6859==    by 0x4E6B663: rb_call0 (eval.c:5928)
==6859==    by 0x4E6B852: rb_call (eval.c:6176)
==6859==    by 0x4E68690: rb_eval (eval.c:3506)
==6859==    by 0x4E68FD5: rb_eval (eval.c:3236)
==6859==    by 0x4E6B512: rb_call0 (eval.c:6079)
==6859==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==6859== 
/var/lib/gems/1.8/gems/nokogiri-1.4.4/lib/nokogiri/xml/reader.rb:99: [BUG] Segmentation fault
ruby 1.8.7 (2010-01-10 patchlevel 249) [x86_64-linux]

==6859== 
==6859== HEAP SUMMARY:
==6859==     in use at exit: 5,964,486 bytes in 28,813 blocks
==6859==   total heap usage: 132,707 allocs, 103,894 frees, 20,165,464 bytes allocated
==6859== 
==6859== LEAK SUMMARY:
==6859==    definitely lost: 60 bytes in 1 blocks
==6859==    indirectly lost: 240 bytes in 10 blocks
==6859==      possibly lost: 4,232 bytes in 52 blocks
==6859==    still reachable: 5,959,954 bytes in 28,750 blocks
==6859==         suppressed: 0 bytes in 0 blocks
==6859== Rerun with --leak-check=full to see details of leaked memory
==6859== 
==6859== For counts of detected and suppressed errors, rerun with: -v
==6859== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
packetmonkey commented 13 years ago

On a whim I also toyed a bit with the ruby garbage collector and memory allocation settings. I don't think I was able to have much of an effect on the problem but then agan I was just throwing numbers at various settings to see what effect it had.

packetmonkey commented 13 years ago

An additional note, parsing the same document using a Nokogiri::XML::SAX::Document with Nokogiri::XML::SAX::Parser does not seem to cause a segfault. This leads me to think the problem isn't with the handling of the XML itself, but internal memory management inside the Reader class.

flavorjones commented 13 years ago

If you want us to be able to fix this issue, we need to be able to reproduce it. Can you help us reproduce it?

packetmonkey commented 13 years ago

Anything I can do to help. It looks like the issue happens when the system runs out of free memory, so it could be an edge case when ruby can't allocate more memory. To test this I am going to create a VM with lower ram and try using a smaller XML file. I am fairly certain it has nothing to do with actually parsing the XML using nokogiri as mentioned in my previous comment.

flavorjones commented 13 years ago

Yes! You can help us by providing your code and helping us reproduce this issue. :)

If Reader is leaking memory, then I'd like to try to fix that. Or, if you're doing something in your code that uses the Reader node that's causing this, then I can probably tell you a workaround or fix the underlying issue.

Your code in previous comments calls a method import_identity. Can you show us that code? And can you share your XML, or a representative sample, or an obfuscated sample with the same structure?

If your code or data are sensitive, then there are other options, including contacting me off-list.

I'd love to help you. Please help me help you. :)

packetmonkey commented 13 years ago

I have managed to reduce the content and script to something reproducible. The segmentation fault does seem to trigger on machines with lower ram faster. I would run this script in a vm with a low amount of ram (256 megs), and it would crash. I would then run the exact same VM with more ram (512 megs or 1 gig) and it would still crash but at a much later time. Below are the steps to generate the XML file and the script which will induce the crash.

execute the following script as

$ ruby xmlgen.rb > test.xml

This script will generate a 322 meg XML file which mimics the schema used by the original database dump which first revealed the issue.

puts "<?xml version=\"1.0\"?>"
puts "<PFWeb:Database xmlns:PFWeb=\"http://www.pageflex.com/schemas/2004/Storefront/Database/20071115\" build-number=\"6.1.2.919\">"
puts "  <PFWeb:Identities__Table>"
500000.times do |i|
puts "    <PFWeb:Identities__Row PFWeb:IdentityID__ID=\"Identities_#{i}\" PFWeb:b_IsGroup=\"False\" PFWeb:b_GroupDisplaysAsList=\"False\" PFWeb:b_HasAccessToUserSite=\"True\" PFWeb:b_HasAccessToAdminSite=\"False\" PFWeb:b_IsActive=\"True\" PFWeb:b_IsArchived=\"True\" PFWeb:b_IsAnonymous=\"False\" PFWeb:b_PendingSelfRegistration=\"False\" PFWeb:LogonName__STR=\"dummy@example                                                                \" PFWeb:PasswordHashed__STR=\"111111111                                                                       \" PFWeb:DirectoryName__STR=\"dummy@example\" PFWeb:WhenCreated__ISO8601=\"2009-03-12T13:51:23\" PFWeb:WhenLastLoggedIn__ISO8601=\"2009-03-16T16:56:18\" />"
end
puts "  </PFWeb:Identities__Table>"
puts "</PFWeb:Database>"

Once you have the test.xml file generated, execute this script to parse and output some data from that XML file. This managed to even cause a segfault on my Macbook Pro with 4 gigs of ram after roughly 8 minutes, so I would assume it will cause a crash on just about any machine you run it on, however as I said it will crash faster inside a VM with very limited RAM.

require 'rubygems'
require 'nokogiri'
require 'ap'
reader = Nokogiri::XML::Reader File.new('test.xml')
reader.each do |node|
  tmp = Hash.new
  tmp['name'] = node.name
  tmp['id'] = node.attributes['IdentityID_ID']
  tmp['username'] = node.attributes['LogonName__STR']
  ap tmp
end

Please let me know if this managed to crash on your system so I know you can reproduce it as well. If not I will continue to try and provide less fragile test case but this version appears to be consistant (for me).

Hope this helps!

flavorjones commented 13 years ago

I've reproduced what appears to be your crash! Thanks for your help. I'll let you know what I dig up.

flavorjones commented 13 years ago

This appears to be related to the issue being worked around in https://github.com/tenderlove/nokogiri/issues/95

packetmonkey commented 13 years ago

There is nothing quite as fun as rediscovering a 1.5 year old bug :) Is there anything else I can do to help you narrow down what is going on?

flavorjones commented 13 years ago

Ha! Well, this isn't a regression, it's just related. The core issue is that libxml2 frees the underlying node struct out from under the ruby object (which retains a pointer to the C struct). There's nothing we can do to avoid the dangling pointer, so we're left with trying to work around dereferencing it.

I've got an idea I should be able to throw up on a branch tonight.

On Wed, Apr 20, 2011 at 12:31 PM, packetmonkey < reply@reply.github.com>wrote:

There is nothing quite as fun as rediscovering a 1.5 year old bug :) Is there anything else I can do to help you narrow down what is going on?

Reply to this email directly or view it on GitHub: https://github.com/tenderlove/nokogiri/issues/439#comment_1033730

flavorjones commented 13 years ago

The previous commit (on branch 439-fix-reader-node-marking) partly fixes your issue, in that the GC phase will no longer try to follow the dangling node pointer.

However, there's a bigger issue here around string references. Your example is saving a reference to node.name, which is a ruby string object wrapped around the libxml2 c-string pointer for the node. As the reader proceeds through the document, this c-string gets freed and the ruby string object is no longer valid, and will explode when it is GCed.

Need to talk to @tenderlove about a strategy here. Option 1 is for Nokogiri to strdup every string that gets returned from a reader node (and think very hard about related objects, such as attributes). Option 2 is to tell people (like yourself :)) to dup anything you're keeping a reference to, lest you segfault.

Option 1 is really what we should do, but it's going to be a bit of work.

Temporary workaround: dup the strings you're keeping around from the reader pass. I'm able to get your example to run cleanly by duping only node.name.

packetmonkey commented 13 years ago

I will give that a shot and confirm it works for me.

flavorjones commented 13 years ago

@packetmonkey - did you confirm that calling #dup on your Reader-returned strings addressed this issue?

packetmonkey commented 13 years ago

Whoa, sorry. I thought I had updated this ticket, my bad.

Yes duping the strings out of reader allowed my scripts to finish executing, even on the original parser with the 7 GB export as well as my contrived scripts that seemed to reproduce the issue.

nkriege commented 13 years ago

I am running into this same issue, and I can reproduce the segfault even if I dup the strings returned from the reader. @packetmonkey, I trimmed down your example a bit and turned it into the following unit test that I dropped into the test directory. Notice that I have added String#dup calls on all three of the strings coming out of the reader.

# -*- coding: utf-8 -*-
require "helper"

class TestReaderSegfault < Nokogiri::TestCase
  def test_crash
    xml = "<?xml version=\"1.0\"?>\n"
    xml += "<Database>\n"
    xml += "  <Identities>\n"
    100.times do |i|
      xml += "    <Row IdentityID=\"Identities_#{i}\" IsGroup=\"False\" GroupDisplaysAsList=\"False\" HasAccessToUserSite=\"True\" HasAccessToAdminSite=\"False\" IsActive=\"True\" IsArchived=\"True\" IsAnonymous=\"False\" PendingSelfRegistration=\"False\" LogonName=\"dummy@example\" PasswordHashed=\"111111111\" DirectoryName=\"dummy@example\" WhenCreated=\"2009-03-12T13:51:23\" WhenLastLoggedIn=\"2009-03-16T16:56:18\" />\n"
    end
    xml += "  </Identities>\n"
    xml += "</Database>\n"

    100.times do |ii|
      reader = Nokogiri::XML::Reader.from_memory(xml)

      while reader.read
        if reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
          tmp = Hash.new
          tmp['name'] = reader.name.dup
          tmp['id'] = reader.attributes['IdentityID'] ? reader.attributes['IdentityID'].dup : nil
          tmp['username'] = reader.attributes['LogonName'] ? reader.attributes['LogonName'].dup : nil
        end
      end
    end

  end
end

This consistently segfaults for me in master, the 439-fix-reader-node-marking branch, and the v1.4.4 tag. Here is an example segfault from when I was running in the 439-fix-reader-node-marking branch.

/home/nkriege/src/nokogiri/lib/nokogiri/xml/reader.rb:99: [BUG] Segmentation fault
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]

-- control frame ----------
c:0021 p:0014 s:0080 b:0077 l:000068 d:000076 BLOCK  /home/nkriege/src/nokogiri/lib/nokogiri/xml/reader.rb:99
c:0020 p:---- s:0074 b:0074 l:000073 d:000073 FINISH
c:0019 p:---- s:0072 b:0072 l:000071 d:000071 CFUNC  :each
c:0018 p:0023 s:0069 b:0069 l:000068 d:000068 METHOD /home/nkriege/src/nokogiri/lib/nokogiri/xml/reader.rb:99
c:0017 p:0018 s:0065 b:0064 l:000063 d:000063 METHOD /home/nkriege/src/nokogiri/lib/nokogiri/xml/reader.rb:90
c:0016 p:0139 s:0061 b:0059 l:000048 d:000058 BLOCK  /home/nkriege/src/nokogiri/test/test_reader_segfault.rb:22
c:0015 p:---- s:0054 b:0054 l:000053 d:000053 FINISH
c:0014 p:---- s:0052 b:0052 l:000051 d:000051 CFUNC  :times
c:0013 p:0069 s:0049 b:0049 l:000048 d:000048 METHOD /home/nkriege/src/nokogiri/test/test_reader_segfault.rb:15
c:0012 p:0063 s:0045 b:0045 l:000044 d:000044 METHOD /opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:695
c:0011 p:0091 s:0039 b:0039 l:000020 d:000038 BLOCK  /opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:656
c:0010 p:---- s:0034 b:0034 l:000033 d:000033 FINISH
c:0009 p:---- s:0032 b:0032 l:000031 d:000031 CFUNC  :each
c:0008 p:0026 s:0029 b:0029 l:000020 d:000028 BLOCK  /opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:650
c:0007 p:---- s:0026 b:0026 l:000025 d:000025 FINISH
c:0006 p:---- s:0024 b:0024 l:000023 d:000023 CFUNC  :each
c:0005 p:0082 s:0021 b:0021 l:000020 d:000020 METHOD /opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:649
c:0004 p:0188 s:0016 b:0016 l:000015 d:000015 METHOD /opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:609
c:0003 p:0057 s:0007 b:0007 l:000f88 d:000860 BLOCK  /opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:508
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
c:0001 p:0000 s:0002 b:0002 l:0010f8 d:0010f8 TOP
---------------------------
-- Ruby level backtrace information ----------------------------------------
/opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:508:in `block in autorun'
/opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:609:in `run'
/opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:649:in `run_test_suites'
/opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:649:in `each'
/opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:650:in `block in run_test_suites'
/opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:650:in `each'
/opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:656:in `block (2 levels) in run_test_suites'
/opt/ruby-1.9.2/lib/ruby/1.9.1/minitest/unit.rb:695:in `run'
/home/nkriege/src/nokogiri/test/test_reader_segfault.rb:15:in `test_crash'
/home/nkriege/src/nokogiri/test/test_reader_segfault.rb:15:in `times'
/home/nkriege/src/nokogiri/test/test_reader_segfault.rb:22:in `block in test_crash'
/home/nkriege/src/nokogiri/lib/nokogiri/xml/reader.rb:90:in `attributes'
/home/nkriege/src/nokogiri/lib/nokogiri/xml/reader.rb:99:in `attribute_nodes'
/home/nkriege/src/nokogiri/lib/nokogiri/xml/reader.rb:99:in `each'
/home/nkriege/src/nokogiri/lib/nokogiri/xml/reader.rb:99:in `block in attribute_nodes'

-- C level backtrace information -------------------------------------------
/opt/ruby-1.9.2/bin/ruby(rb_vm_bugreport+0x9e) [0x52381e]
/opt/ruby-1.9.2/bin/ruby() [0x564f58]
/opt/ruby-1.9.2/bin/ruby(rb_bug+0xb1) [0x5650f1]
/opt/ruby-1.9.2/bin/ruby() [0x4b2878]
/lib/libpthread.so.0(+0xf8f0) [0x7f4d187b68f0]
/opt/ruby-1.9.2/bin/ruby(st_lookup+0xe) [0x4ba11e]
/opt/ruby-1.9.2/bin/ruby(rb_method_entry+0x6f) [0x50f4cf]
/opt/ruby-1.9.2/bin/ruby() [0x518521]
/opt/ruby-1.9.2/bin/ruby() [0x519149]
/opt/ruby-1.9.2/bin/ruby(rb_yield+0x66) [0x521a06]
/opt/ruby-1.9.2/bin/ruby(rb_ary_each+0x45) [0x533225]
/opt/ruby-1.9.2/bin/ruby() [0x512cff]
/opt/ruby-1.9.2/bin/ruby() [0x514436]
/opt/ruby-1.9.2/bin/ruby() [0x519149]
/opt/ruby-1.9.2/bin/ruby(rb_yield+0x66) [0x521a06]
/opt/ruby-1.9.2/bin/ruby() [0x44aad1]
/opt/ruby-1.9.2/bin/ruby() [0x512cff]
/opt/ruby-1.9.2/bin/ruby() [0x514436]
/opt/ruby-1.9.2/bin/ruby() [0x519149]
/opt/ruby-1.9.2/bin/ruby(rb_yield+0x66) [0x521a06]
/opt/ruby-1.9.2/bin/ruby(rb_ary_each+0x45) [0x533225]
/opt/ruby-1.9.2/bin/ruby() [0x512cff]
/opt/ruby-1.9.2/bin/ruby() [0x514436]
/opt/ruby-1.9.2/bin/ruby() [0x519149]
/opt/ruby-1.9.2/bin/ruby(rb_yield+0x66) [0x521a06]
/opt/ruby-1.9.2/bin/ruby(rb_ary_each+0x45) [0x533225]
/opt/ruby-1.9.2/bin/ruby() [0x512cff]
/opt/ruby-1.9.2/bin/ruby() [0x514436]
/opt/ruby-1.9.2/bin/ruby() [0x519149]
/opt/ruby-1.9.2/bin/ruby(rb_vm_invoke_proc+0x9f) [0x51c03f]
/opt/ruby-1.9.2/bin/ruby(rb_exec_end_proc+0x238) [0x41c188]
/opt/ruby-1.9.2/bin/ruby() [0x41c254]
/opt/ruby-1.9.2/bin/ruby(ruby_cleanup+0x12d) [0x41c3cd]
/opt/ruby-1.9.2/bin/ruby(ruby_run_node+0x3d) [0x41c6dd]
/opt/ruby-1.9.2/bin/ruby(main+0x49) [0x419c19]
/lib/libc.so.6(__libc_start_main+0xfd) [0x7f4d17b7ac4d]
/opt/ruby-1.9.2/bin/ruby() [0x419b09]

I am running ruby 1.9.2p180 on 64bit Ubuntu Lucid. Here is my version info as output by the test helper.

{
  "warnings"=>[], 
  "nokogiri"=>"1.4.4.2", 
  "ruby"=>{"version"=>"1.9.2", "platform"=>"x86_64-linux", "description"=>"ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]", "engine"=>"ruby"}, 
  "libxml"=>{"binding"=>"extension", "compiled"=>"2.7.6", "loaded"=>"2.7.6"}
}

Are you duping your strings in some other way? Does my version of the test segfault for you?

flavorjones commented 13 years ago

Hey all. We actually know what the problems are. One is fixed on master and will be in 1.5.0.

The other problem with Reader is that it's returning Ruby Strings that are having the underlying C string swept out from under them by libxml2. The change to fix this is a little invasive, and has performance implications, so we're holding off on fixing until after 1.5.0 drops (this weekend hopefully?).

flavorjones commented 13 years ago

Reopening per my above comment -- commit e95a344 fixes one aspect of the Reader problem; but there is more work that needs to be done here.