Closed Overload119 closed 2 years ago
@Overload119 Thanks for opening this issue, I'll try to help.
Please respond with the output from nokogiri -v
on this system so I know more about your system and how nokogiri was built.
Have you modified this installation at all? The stack walkback contains a strange line that does not correspond to any files in the gem:
c:0014 p:0293 s:0092 e:000091 METHOD /usr/local/bundle/gems/nokogiri-1.12.5-x86_64-linux/lib/x:215
I would expect to see that stack frame to be labeled lib/nokogiri/html4/document.rb
It's possible that you're triggering a concurrency-related issue by using the parallel gem. Can you tell me more about how you're using it (processes, or threads)?
I'm using threads, and this is how I'm using Parallel:
https://gist.github.com/Overload119/dee7b63406d9929d1f0dae0cc6158656#file-crawl_website-rb-L87
I've attached a minimal repro (I will try to make it more minimal above.
I have also updated the original issue with nokogiri -v
I have not modified the installation to my recollection on either system.
Please do keep trying to make this as minimal as possible. I'll spend some time seeing if I can trigger a concurrency bug in HTML4::Document.read_memory
.
Just to be clear: are you running this in the linux docker container? Some of the info you've provided is from a darwin system so I wanted to confirm there's not something even more complex going on.
I'm unable to see any problems running Nokogiri with varying levels of load and concurrency with this script:
#! /usr/bin/env ruby
require "nokogiri"
require "parallel"
html = File.read("test/files/tlm.html")
loop do
n_doc = rand(1000) + 1
n_thread = rand(100) + 1
print "> #{n_doc}/#{n_thread}:"
begin
docs = []
Parallel.each(1.upto(n_doc).to_a, in_threads: n_thread) do
docs << Nokogiri::HTML4.parse(html)
putc "."
end
end
GC.start(full_mark: true)
puts
end
C functions are called with the GVL so I'm not surprised that I can't reproduce a problem. Without a self-contained reproducible test case, I'm not sure how to help.
@Overload119 Can you provide any additional information about reproducing this?
After turning concurrency to 0 I no longer have the issue. I'm using a Rails environment and I think that's related to it.
I'd like to come back to enabling my script with concurrency but don't have a timeline for that so feel free to close. Can I reopen the issue when I make time to revisit it?
@Overload119 Please do follow up if you can reproduce it, thank you. Happy to re-open this issue when you do.
While running a long running job that basically scrapes 20 pages from a website multiple times, I sporadically get the following error and the Ruby process dies. It's a fairly long running process and I'm still working on making a bare-minimum reproducible case I can share here.
Starting this early to see if there are any tips on providing more information to help debug this.
Help us reproduce what you're seeing
Environment