sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.15k stars 899 forks source link

[bug] Segmentation fault related to concurrency #2390

Closed Overload119 closed 2 years ago

Overload119 commented 2 years ago

While running a long running job that basically scrapes 20 pages from a website multiple times, I sporadically get the following error and the Ruby process dies. It's a fairly long running process and I'm still working on making a bare-minimum reproducible case I can share here.

Starting this early to see if there are any tips on providing more information to help debug this.

I have reproduced this both on my Mac OSX machine as well as my Docker container. Mac OSX: ``` /usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/lib/nokogiri/html4/document.rb:215: [BUG] Segmentation fault at 0x00002aabe98c1028 ruby 2.7.4p191 (2021-07-07 revision a21a3b7d23) [x86_64-linux] -- Control frame information ----------------------------------------------- c:0015 p:---- s:0100 e:000099 CFUNC :read_memory c:0014 p:0293 s:0092 e:000091 METHOD /usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/lib/x:215 c:0013 p:0056 s:0083 e:000082 METHOD /usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/lib/nokogiri/html4.rb:7 c:0012 p:0069 s:0074 e:000069 METHOD /sapco/app/typed_service_objects/crawl_website.rb:199 [FINISH] c:0011 p:---- s:0063 e:000062 CFUNC :bind_call c:0010 p:0065 s:0057 e:000056 BLOCK /usr/local/bundle/gems/sorbet-runtime-0.5.9155/lib/types/private/methods/call_validation_2_7.rb:106 [FINISH] c:0009 p:0214 s:0050 e:000049 BLOCK /sapco/app/typed_service_objects/crawl_website.rb:111 c:0008 p:0015 s:0045 e:000044 METHOD /usr/local/bundle/gems/activesupport-6.1.4.1/lib/active_support/execution_wrapper.rb:88 c:0007 p:0017 s:0040 e:000039 BLOCK /sapco/app/typed_service_objects/crawl_website.rb:92 c:0006 p:0048 s:0036 e:000035 METHOD /usr/local/bundle/gems/parallel-1.20.1/lib/parallel.rb:509 c:0005 p:0015 s:0027 e:000026 BLOCK /usr/local/bundle/gems/parallel-1.20.1/lib/parallel.rb:367 c:0004 p:0033 s:0024 e:000023 METHOD /usr/local/bundle/gems/parallel-1.20.1/lib/parallel.rb:518 c:0003 p:0032 s:0014 e:000013 BLOCK /usr/local/bundle/gems/parallel-1.20.1/lib/parallel.rb:366 c:0002 p:0005 s:0006 e:000005 BLOCK /usr/local/bundle/gems/parallel-1.20.1/lib/parallel.rb:215 [FINISH] c:0001 p:---- s:0003 e:000002 (none) [FINISH] -- Ruby level backtrace information ---------------------------------------- /usr/local/bundle/gems/parallel-1.20.1/lib/parallel.rb:215:in `block (4 levels) in in_threads' /usr/local/bundle/gems/parallel-1.20.1/lib/parallel.rb:366:in `block in work_in_threads' /usr/local/bundle/gems/parallel-1.20.1/lib/parallel.rb:518:in `with_instrumentation' /usr/local/bundle/gems/parallel-1.20.1/lib/parallel.rb:367:in `block (2 levels) in work_in_threads' /usr/local/bundle/gems/parallel-1.20.1/lib/parallel.rb:509:in `call_with_index' /sapco/app/typed_service_objects/crawl_website.rb:92:in `block (3 levels) in perform' /usr/local/bundle/gems/activesupport-6.1.4.1/lib/active_support/execution_wrapper.rb:88:in `wrap' /sapco/app/typed_service_objects/crawl_website.rb:111:in `block (4 levels) in perform' /usr/local/bundle/gems/sorbet-runtime-0.5.9155/lib/types/private/methods/call_validation_2_7.rb:106:in `block in create_validator_method_fast1' /usr/local/bundle/gems/sorbet-runtime-0.5.9155/lib/types/private/methods/call_validation_2_7.rb:106:in `bind_call' /sapco/app/typed_service_objects/crawl_website.rb:199:in `crawl_page' /usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/lib/nokogiri/html4.rb:7:in `HTML4' /usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/lib/nokogiri/html4/document.rb:215:in `parse' /usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/lib/nokogiri/html4/document.rb:215:in `read_memory' -- Machine register context ------------------------------------------------ RIP: 0x00002aab94422e3f RBP: 0x0000000000001000 RSP: 0x00002aabae4fec30 RAX: 0x0000000000000fe1 RBX: 0x00002aabe8000020 RCX: 0x00002aabe98c1020 RDX: 0x00002aabe98c0f60 RDI: 0x00002aab9455cc40 RSI: 0x00000000000000d4 R8: 0x0000000000000000 R9: 0x00002aabe8000000 R10: 0xfffffffffffff000 R11: 0x00000000018c1000 R12: 0x00000000000000d5 R13: 0x00002aabe98c0f50 R14: 0x00000000000000b0 R15: 0x00002aabe8000080 EFL: 0x0000000000010202 -- C level backtrace information ------------------------------------------- ```

Help us reproduce what you're seeing

Environment

  * nokogiri (1.12.5)
    Summary: Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
    Homepage: https://nokogiri.org
    Path: /Users/amirsharif/.rvm/gems/ruby-2.7.4/gems/nokogiri-1.12.5-x86_64-darwin
``` # Nokogiri (1.12.5) --- warnings: [] nokogiri: version: 1.12.5 cppflags: - "-I/Users/amirsharif/.rvm/gems/ruby-2.7.4/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri" - "-I/Users/amirsharif/.rvm/gems/ruby-2.7.4/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri/include" - "-I/Users/amirsharif/.rvm/gems/ruby-2.7.4/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri/include/libxml2" ldflags: [] ruby: version: 2.7.4 platform: x86_64-darwin20 gem_platform: x86_64-darwin-20 description: ruby 2.7.4p191 (2021-07-07 revision a21a3b7d23) [x86_64-darwin20] engine: ruby libxml: source: packaged precompiled: true patches: - 0001-Remove-script-macro-support.patch - 0002-Update-entities-to-remove-handling-of-ssi.patch - 0003-libxml2.la-is-in-top_builddir.patch - 0004-use-glibc-strlen.patch - 0005-avoid-isnan-isinf.patch - 0006-update-automake-files-for-arm64.patch - 0007-Fix-XPath-recursion-limit.patch libxml2_path: "/Users/amirsharif/.rvm/gems/ruby-2.7.4/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri" memory_management: ruby iconv_enabled: true compiled: 2.9.12 loaded: 2.9.12 libxslt: source: packaged precompiled: true patches: - 0001-update-automake-files-for-arm64.patch - 0002-Fix-xml2-config-check-in-configure-script.patch datetime_enabled: true compiled: 1.1.34 loaded: 1.1.34 other_libraries: zlib: 1.2.11 libiconv: '1.15' libgumbo: 1.0.0-nokogiri ``` Docker: ``` # Nokogiri (1.12.5) --- warnings: [] nokogiri: version: 1.12.5 cppflags: - "-I/usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/ext/nokogiri" - "-I/usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/ext/nokogiri/include" - "-I/usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/ext/nokogiri/include/libxml2" ldflags: [] ruby: version: 2.7.4 platform: x86_64-linux gem_platform: x86_64-linux description: ruby 2.7.4p191 (2021-07-07 revision a21a3b7d23) [x86_64-linux] engine: ruby libxml: source: packaged precompiled: true patches: - 0001-Remove-script-macro-support.patch - 0002-Update-entities-to-remove-handling-of-ssi.patch - 0003-libxml2.la-is-in-top_builddir.patch - 0004-use-glibc-strlen.patch - 0005-avoid-isnan-isinf.patch - 0006-update-automake-files-for-arm64.patch - 0007-Fix-XPath-recursion-limit.patch libxml2_path: "/usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/ext/nokogiri" memory_management: ruby iconv_enabled: true compiled: 2.9.12 loaded: 2.9.12 libxslt: source: packaged precompiled: true patches: - 0001-update-automake-files-for-arm64.patch - 0002-Fix-xml2-config-check-in-configure-script.patch datetime_enabled: true compiled: 1.1.34 loaded: 1.1.34 other_libraries: zlib: 1.2.11 libgumbo: 1.0.0-nokogiri ```
flavorjones commented 2 years ago

@Overload119 Thanks for opening this issue, I'll try to help.

Please respond with the output from nokogiri -v on this system so I know more about your system and how nokogiri was built.

Have you modified this installation at all? The stack walkback contains a strange line that does not correspond to any files in the gem:

c:0014 p:0293 s:0092 e:000091 METHOD /usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/lib/x:215

I would expect to see that stack frame to be labeled lib/nokogiri/html4/document.rb

It's possible that you're triggering a concurrency-related issue by using the parallel gem. Can you tell me more about how you're using it (processes, or threads)?

Overload119 commented 2 years ago

I'm using threads, and this is how I'm using Parallel:

https://gist.github.com/Overload119/dee7b63406d9929d1f0dae0cc6158656#file-crawl_website-rb-L87

I've attached a minimal repro (I will try to make it more minimal above. I have also updated the original issue with nokogiri -v

I have not modified the installation to my recollection on either system.

flavorjones commented 2 years ago

Please do keep trying to make this as minimal as possible. I'll spend some time seeing if I can trigger a concurrency bug in HTML4::Document.read_memory.

flavorjones commented 2 years ago

Just to be clear: are you running this in the linux docker container? Some of the info you've provided is from a darwin system so I wanted to confirm there's not something even more complex going on.

flavorjones commented 2 years ago

I'm unable to see any problems running Nokogiri with varying levels of load and concurrency with this script:

#! /usr/bin/env ruby

require "nokogiri"
require "parallel"

html = File.read("test/files/tlm.html")

loop do
  n_doc = rand(1000) + 1
  n_thread = rand(100) + 1
  print "> #{n_doc}/#{n_thread}:"
  begin
    docs = []
    Parallel.each(1.upto(n_doc).to_a, in_threads: n_thread) do
      docs << Nokogiri::HTML4.parse(html)
      putc "."
    end
  end
  GC.start(full_mark: true)
  puts
end

C functions are called with the GVL so I'm not surprised that I can't reproduce a problem. Without a self-contained reproducible test case, I'm not sure how to help.

flavorjones commented 2 years ago

@Overload119 Can you provide any additional information about reproducing this?

Overload119 commented 2 years ago

After turning concurrency to 0 I no longer have the issue. I'm using a Rails environment and I think that's related to it.

I'd like to come back to enabling my script with concurrency but don't have a timeline for that so feel free to close. Can I reopen the issue when I make time to revisit it?

flavorjones commented 2 years ago

@Overload119 Please do follow up if you can reproduce it, thank you. Happy to re-open this issue when you do.