ohler55 / ox

Ruby Optimized XML Parser
http://www.ohler.com/ox
MIT License
900 stars 76 forks source link

Segfault when parsing deeply nested HTML #334

Closed pwoolcoc closed 1 year ago

pwoolcoc commented 1 year ago

I'm seeing a segfault under Ruby 3.1.3 when SAX parsing deeply nested HTML:

$ ruby -v
ruby 3.1.3p185 (2022-11-24 revision 1a6b16756e) [x86_64-darwin22]
$ gem list | grep "ox"
ox (2.14.14)
#!/usr/bin/env ruby

require 'stringio'
require 'ox'

class Sample < ::Ox::Sax
  def start_element(name); end
  def end_element(name); end
  def attr(name, value); end
  def text(value); end
end

html = File.read('nested.html')

handler = Sample.new()
Ox.sax_parse(handler, StringIO.new(html))

It seems like nesting elements 32 layers deep is what triggers the segfault, and from what I can tell it doesn't really matter what the elements are. Here's nested.html:

<!DOCTYPE html>
<html>
    <body>
        <div>
            <div>
                <div>
                    <div>
                        <div>
                            <div>
                                <div>
                                    <div>
                                        <div>
                                            <div>
                                                <div>
                                                    <div>
                                                        <div>
                                                            <div>
                                                                <div>
                                                                    <div>
                                                                        <div>
                                                                            <div>
                                                                                <div>
                                                                                    <div>
                                                                                        <div>
                                                                                            <div>
                                                                                                <div>
                                                                                                    <div>
                                                                                                        <div>
                                                                                                            <div>
                                                                                                                <div>
                                                                                                                    <div>
                                                                                                                        <div>
                                                                                                                            <div>
                                                                                                                                <div></div>
                                                                                                                            </div>
                                                                                                                        </div>
                                                                                                                    </div>
                                                                                                                </div>
                                                                                                            </div>
                                                                                                        </div>
                                                                                                    </div>
                                                                                                </div>
                                                                                            </div>
                                                                                        </div>
                                                                                    </div>
                                                                                </div>
                                                                            </div>
                                                                        </div>
                                                                    </div>
                                                                </div>
                                                            </div>
                                                        </div>
                                                    </div>
                                                </div>
                                            </div>
                                        </div>
                                    </div>
                                </div>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </body>
</html>

And finally, here's the crash report:

Process:               ruby [79718]
Path:                  /Users/USER/*/ruby
Identifier:            ruby
Version:               ???
Code Type:             X86-64 (Native)
Parent Process:        zsh [55372]
Responsible:           iTerm2 [1993]
User ID:               1821780739

Date/Time:             2023-04-11 15:16:01.2887 -0400
OS Version:            macOS 13.3.1 (22E261)
Report Version:        12
Bridge OS Version:     7.4 (20P4252)
Anonymous UUID:        F8718707-179C-1224-0330-9038F1D97B66

Sleep/Wake UUID:       32B1C2A2-CE22-40C5-8D17-812DBDDA0A27

Time Awake Since Boot: 53000 seconds
Time Since Wake:       23544 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_CRASH (SIGABRT)
Exception Codes:       0x0000000000000000, 0x0000000000000000

Application Specific Information:
abort() called

Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib              0x7ff808de21f2 __pthread_kill + 10
1   libsystem_pthread.dylib             0x7ff808e19ee6 pthread_kill + 263
2   libsystem_c.dylib                   0x7ff808d40b45 abort + 123
3   libsystem_malloc.dylib              0x7ff808c57752 malloc_vreport + 888
4   libsystem_malloc.dylib              0x7ff808c5ab31 malloc_report + 151
5   libruby.3.1.dylib                      0x109fb49e3 objspace_xfree + 16 (gc.c:11675) [inlined]
6   libruby.3.1.dylib                      0x109fb49e3 ruby_sized_xfree + 47 (gc.c:11767) [inlined]
7   libruby.3.1.dylib                      0x109fb49e3 ruby_xfree + 51 (gc.c:11774)
8   ox.bundle                              0x109e09e19 stack_pop + 24 (sax_stack.h:92) [inlined]
9   ox.bundle                              0x109e09e19 read_element_end + 274 (sax.c:950) [inlined]
10  ox.bundle                              0x109e09e19 parse + 6441 (sax.c:440)
11  ox.bundle                              0x109e07e79 protect_parse + 9 (sax.c:66)
12  libruby.3.1.dylib                      0x109fa5301 rb_protect + 337 (eval.c:967)
13  ox.bundle                              0x109e07dee ox_sax_parse + 1150 (sax.c:91)
14  ox.bundle                              0x109e033d2 sax_parse + 498 (ox.c:1108)
15  libruby.3.1.dylib                      0x10a15c65d vm_call_cfunc_with_frame + 349 (vm_insnhelper.c:3037)
16  libruby.3.1.dylib                      0x10a15e9a4 vm_sendish + 244 (vm_insnhelper.c:4751)
17  libruby.3.1.dylib                      0x10a13eb92 vm_exec_core + 10466 (insns.def:778)
18  libruby.3.1.dylib                      0x10a152b91 rb_vm_exec + 2561
19  libruby.3.1.dylib                      0x109fa451b rb_ec_exec_node + 283 (eval.c:280)
20  libruby.3.1.dylib                      0x109fa43b3 ruby_run_node + 83 (eval.c:321)
21  ruby                                   0x109a50f6d main + 93 (main.c:47)
22  dyld                                0x7ff808ac041f start + 1903

Thread 1:
0   libsystem_kernel.dylib              0x7ff808de229e poll + 10
1   libruby.3.1.dylib                      0x10a11467c timer_pthread_fn + 140 (thread_pthread.c:2263)
2   libsystem_pthread.dylib             0x7ff808e1a1d3 _pthread_start + 125
3   libsystem_pthread.dylib             0x7ff808e15bd3 thread_start + 15

Thread 0 crashed with X86 Thread State (64-bit):
  rax: 0x0000000000000000  rbx: 0x00007ff84c4ed340  rcx: 0x00007ff7b64adf78  rdx: 0x0000000000000000
  rdi: 0x0000000000000103  rsi: 0x0000000000000006  rbp: 0x00007ff7b64adfa0  rsp: 0x00007ff7b64adf78
   r8: 0x000000000000002e   r9: 0x0000000000000000  r10: 0x0000000000000000  r11: 0x0000000000000246
  r12: 0x0000000000000103  r13: 0x0000000109ad5028  r14: 0x0000000000000006  r15: 0x0000000000000016
  rip: 0x00007ff808de21f2  rfl: 0x0000000000000246  cr2: 0x00007ff84a776008

Logical CPU:     0
Error Code:      0x02000148 
Trap Number:     133
pwoolcoc commented 1 year ago

I did the reproduction on osx but we are also seeing this happen on x86_64 linux

ohler55 commented 1 year ago

I'll dig into it and get it fixed.

pwoolcoc commented 1 year ago

thanks!

ohler55 commented 1 year ago

I tried with the latest version and it did not crash. I wonder if the fixs in the last two versions took care of the issue. Can you try with v2.14.16?

pwoolcoc commented 1 year ago

Thanks, apparently the last time I checked for a new version was before 2.14.15 went out. Upgrading to 2.14.16 did fix the issue, thanks!