sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.14k stars 899 forks source link

[bug] stack level too deep (SystemStackError) in Fiber #3034

Closed bendillinger closed 11 months ago

bendillinger commented 11 months ago

Please describe the bug

I'm seeing a stack level too deep (SystemStackError) parsing with Nokogiri, but only inside of a Fiber at a certain depth level

Help us reproduce what you're seeing

require 'nokogiri'

# this outputs 1 to 500
(1..500).each do |n|
  puts n
  html = "<html>#{"<strong>" * n }asdf#{"</strong>" * n}</html>"
  document = Nokogiri::HTML5(html, max_tree_depth: 1000)
  document.traverse { |node| node }
end

# this produces a `SystemStackError` at n=261 in `traverse`
(1..500).each do |n|
  puts n
  Fiber.new do
    html = "<html>#{"<strong>" * n }asdf#{"</strong>" * n}</html>"
    document = Nokogiri::HTML5(html, max_tree_depth: 1000)
    document.traverse { |node| node }
  end.resume
end

Expected behavior

Expecting that the traversal would perform the same in or outside of a fiber.

Environment

# Nokogiri (1.15.5)
    ---
    warnings: []
    nokogiri:
      version: 1.15.5
      cppflags:
      - "-I/Users/benjamindillinger/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/nokogiri-1.15.5-arm64-darwin/ext/nokogiri"
      - "-I/Users/benjamindillinger/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/nokogiri-1.15.5-arm64-darwin/ext/nokogiri/include"
      - "-I/Users/benjamindillinger/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/nokogiri-1.15.5-arm64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.2.2
      platform: arm64-darwin21
      gem_platform: arm64-darwin-21
      description: ruby 3.2.2 (2023-03-30 revision e51014f9c0) +YJIT [arm64-darwin21]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      - 0010-update-config.guess-and-config.sub-for-libxml2.patch
      - 0011-rip-out-libxml2-s-libc_single_threaded-support.patch
      libxml2_path: "/Users/benjamindillinger/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/nokogiri-1.15.5-arm64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.11.6
      loaded: 2.11.6
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-config.guess-and-config.sub-for-libxslt.patch
      datetime_enabled: true
      compiled: 1.1.39
      loaded: 1.1.39
    other_libraries:
      zlib: 1.2.13
      libiconv: '1.17'
      libgumbo: 1.0.0-nokogiri
flavorjones commented 11 months ago

Hi @bendillinger, thanks for asking this question, I'll try to help.

Expecting that the traversal would perform the same in or outside of a fiber.

Well, unfortunately, ruby fibers have a different stack than the main thread, and that stack is limited in size.

From https://docs.ruby-lang.org/en/master/Fiber.html:

As opposed to other stackless light weight concurrency models, each fiber comes with a stack. This enables the fiber to be paused from deeply nested function calls within the fiber block. See the ruby(1) manpage to configure the size of the fiber stack(s).

From the ruby man page:

STACK SIZE ENVIRONMENT

Stack size environment variables are implementation-dependent and subject to change with different versions of Ruby. The VM stack is used for pure-Ruby code and managed by the virtual machine. Machine stack is used by the operating system and its usage is dependent on C extensions as well as C compiler options. Using lower values for these may allow applications to keep more Fibers or Threads running; but increases the chance of SystemStackError exceptions and segmentation faults (SIGSEGV). These environment variables are available since Ruby 2.0.0. All values are specified in bytes.

RUBY_THREAD_VM_STACK_SIZE

  • VM stack size used at thread creation. default: 131072 (32-bit CPU) or 262144 (64-bit)

RUBY_THREAD_MACHINE_STACK_SIZE

  • Machine stack size used at thread creation. default: 524288 or 1048575

RUBY_FIBER_VM_STACK_SIZE

  • VM stack size used at fiber creation. default: 65536 or 131072

RUBY_FIBER_MACHINE_STACK_SIZE

  • Machine stack size used at fiber creation. default: 262144 or 524288

You should be able to set RUBY_FIBER_VM_STACK_SIZE and RUBY_FIBER_MACHINE_STACK_SIZE and improve this behavior.

Hope this helps!

bendillinger commented 11 months ago

@flavorjones Awesome thanks! That's incredibly informative