Closed geoffharcourt closed 1 year ago
@geoffharcourt were you ever able to figure this out? I've started seeing the same thing. For me, it's only happening on Github Actions CI, it will run locally every time 😕
@lserman we're not on Github Actions. We're running these test steps in a Docker container on Buildkite. We kept this debugging on for several weeks, and in every failure there was a huge amount of RAM left, so while this failure has been linked to that problem in the past, I think this looks like a new cause.
After a lot of further testing, we're very sure this is related to opening new browser tabs and the wrong window handles being referenced in the following spec, but we're not 100% sure how to fix it. Our temporary solution is monkey-patching Cuprite like this:
module Capybara::Cuprite
class Browser
def visit(*args)
goto(*args)
rescue Ferrum::DeadBrowserError, Ferrum::TimeoutError => exception
puts "Window handles at #{exception.class} (#{Time.current}): #{window_handles}"
puts "Restarting browser..."
Capybara.reset!
puts "Window handles just after restart (#{Time.current}): #{window_handles}"
raise exception
end
end
end
Since we added this change (and some other debugging to show the window handles in previous tests), we can see this happens after a spec where two window handles existed (from a popup or a second tab being opened). I suspect that the second tab might either not be getting closed or it might be getting closed late, but we're not sure.
@geoffharcourt do you have a debug log for this test when it's failing?
We were running into these issues intermittently as well. We installed minitest-retry in our system test suite helper and provided those two exceptions to the configuration. Seems to be working ok right now, but certainly a bit of a hack.
@geoffharcourt @lancecarlson still log with issue is appreciated.
@route what is your recommended why to log the issue? Sorry I don't have much experience with configuring Ferrum because I wasn't the one that setup the CI initially.
@lancecarlson You can pass in a file to write the debug log to when you create a browser: link
So for example, you could do Ferrum::Browser.new(logger: $stderr)
or
log = File.open('ferrum.log', 'w')
at_exit { log.close }
browser = Ferrum::Browser.new(logger: log)
Hi @route apologies for the wait. We had a run with several dead browser errors this morning. Here's a link to the log output. This is really long, I couldn't find what was relevant or useful so I'm just putting it all here for now. https://gist.github.com/geoffharcourt/94f86f75d30b1e2ad90edc5e2f5cba30
No problem @geoffharcourt I'll take a look
FWIW, the workaround I posted above turned this from a pretty frustrating situation into one that one retry nearly always works around. It seems like this might be evidence of an issue that's not the usual out of RAM problem.
I went through the log file and didn't spot any issue regarding the problem, but what I think weird is this line:
7.329515521000076 {"method":"Runtime.callFunctionOn","params":{"returnByValue":true,"objectId":"-8742411280284681957.2.134","functionDeclaration":"function() { return _cuprite.isVisible(this) }","arguments":[]},"id":1366}
7.335774601999901 {"method":"Fetch.enable","params":{"handleAuthRequests":true,"patterns":[{"urlPattern":"*"}]},"id":1367}
7.3531246129998635 {"method":"Target.disposeBrowserContext","params":{"browserContextId":"6DDC3F570DD6804CE57039079ABFCC73"},"id":9}
So we work with a node in DOM and then suddenly calling Fetch.enable
again to intercept requests, after that you are disposing the browser's context.
@geoffharcourt Could you try to experiment with turning off blacklist/whitelist options and network interruptions and see if it helps the issue disappear?
I'll reopen when you provide some info, without it I can't help.
We have similar issues when Ferrum::DeadBrowserError
exceptions happen for some specs while running on CI. Those exceptions are very hard to reproduce locally in order to debug them.
I did a workaround which works for us quite well, an improved version of @geoffharcourt approach (https://github.com/rubycdp/ferrum/issues/272#issuecomment-1276905127).
https://github.com/kamui/retriable gem is used to wrap retries (but it's trivial to change to plain Ruby or another retry wrapper gem).
# spec/support/capybara/cuprite/browser_with_auto_restart.rb
module Capybara
module Cuprite
module BrowserWithAutoRestart
EXCEPTIONS_TO_CATCH = [
Ferrum::DeadBrowserError,
Ferrum::NoSuchTargetError,
Ferrum::TimeoutError
].freeze
RETRY_PROC = proc do |exception, _try, _elapsed_time, _next_interval|
Rails.logger.warn("Browser crashed with error: #{exception.class} (#{Time.current})")
Rails.logger.warn("Restarting browser...")
Capybara.current_session.driver.browser.restart
Capybara.reset!
Rails.logger.warn("Browser restarted (#{Time.current})")
end
def visit(...)
Retriable.retriable(on: EXCEPTIONS_TO_CATCH, on_retry: RETRY_PROC, tries: 3) do
super
end
end
end
end
end
Capybara::Cuprite::Browser.prepend(Capybara::Cuprite::BrowserWithAutoRestart)
# spec/system_helper.rb
# Load general RSpec Rails configuration
require "rails_helper"
# Load configuration files and helpers
require "capybara/rails"
require "capybara/rspec"
require "support/capybara/cuprite/browser_with_auto_restart"
# ...
Hi,
We've been using Cuprite for a long time, and in the last year or so we've started to see an increasing rate of failures for
DeadBrowserError
in our CI pipeline. At first we thought it was RAM, so we increased the amount of RAM on our CI containers, but then continued to have failures. We also started recording system RAM when we got a failure and we typically have over 5GB free RAM when theFerrum::DeadBrowserError
first appears. From that point forward, all future system tests on that container will fail.We tried resetting the browser with
reset!
and retrying the example with rspec-retry, but still no luck. We've finally gotten around to collecting logs, and I can't find anything conclusive from the logs.I've captured a (very long, I can truncate it but thought the whole log might be necessary) log file here from one of our CI containers after it stopped being able to run specs and failed with:
https://gist.github.com/geoffharcourt/b6315e4b7a20bea9c11a0aee755196a8
A few notes that might be helpful:
The dead browser error never happens mid-example, it's always the first page visited in an example.
We use percy-capybara to grab screenshots of pages mid-spec and upload the DOM and assets to Percy, and it uses the installed Chrome. I suspect it might be somehow to blame for what's going on, but I've seen containers fail where Percy didn't take any snapshots (or also hasn't yet taken snapshots), and in failed runs the failure doesn't always follow an example where a snapshot was taken.
We use knapsack-pro to parallelize our test suite across multiple machines in CI. It runs RSpec several times during each machine's portion of the test suite on a small subset of our code.
The examples that fail aren't the same files, nor do they consistently have the same preceding examples run before them from what I can tell. We're really struggling to understand what's going on here, if you have any ideas at all it would be a huge help. Thanks for your work on this project, it's really great.