Ferrum hangs with DeadBrowserError for some reason other than running out of RAM

geoffharcourt commented 2 years ago

Hi,

We've been using Cuprite for a long time, and in the last year or so we've started to see an increasing rate of failures for DeadBrowserError in our CI pipeline. At first we thought it was RAM, so we increased the amount of RAM on our CI containers, but then continued to have failures. We also started recording system RAM when we got a failure and we typically have over 5GB free RAM when the Ferrum::DeadBrowserError first appears. From that point forward, all future system tests on that container will fail.

We tried resetting the browser with reset! and retrying the example with rspec-retry, but still no luck. We've finally gotten around to collecting logs, and I can't find anything conclusive from the logs.

I've captured a (very long, I can truncate it but thought the whole log might be necessary) log file here from one of our CI containers after it stopped being able to run specs and failed with:


1st Try error in ./spec/system/teacher/views/lessons_spec.rb:759:
--
  | There's no context available
  | Browser is dead or given window is closed

https://gist.github.com/geoffharcourt/b6315e4b7a20bea9c11a0aee755196a8

A few notes that might be helpful:

The dead browser error never happens mid-example, it's always the first page visited in an example.

We use percy-capybara to grab screenshots of pages mid-spec and upload the DOM and assets to Percy, and it uses the installed Chrome. I suspect it might be somehow to blame for what's going on, but I've seen containers fail where Percy didn't take any snapshots (or also hasn't yet taken snapshots), and in failed runs the failure doesn't always follow an example where a snapshot was taken.

We use knapsack-pro to parallelize our test suite across multiple machines in CI. It runs RSpec several times during each machine's portion of the test suite on a small subset of our code.

The examples that fail aren't the same files, nor do they consistently have the same preceding examples run before them from what I can tell. We're really struggling to understand what's going on here, if you have any ideas at all it would be a huge help. Thanks for your work on this project, it's really great.

lserman commented 2 years ago

@geoffharcourt were you ever able to figure this out? I've started seeing the same thing. For me, it's only happening on Github Actions CI, it will run locally every time 😕

geoffharcourt commented 2 years ago

@lserman we're not on Github Actions. We're running these test steps in a Docker container on Buildkite. We kept this debugging on for several weeks, and in every failure there was a huge amount of RAM left, so while this failure has been linked to that problem in the past, I think this looks like a new cause.

geoffharcourt commented 1 year ago

After a lot of further testing, we're very sure this is related to opening new browser tabs and the wrong window handles being referenced in the following spec, but we're not 100% sure how to fix it. Our temporary solution is monkey-patching Cuprite like this:

module Capybara::Cuprite
  class Browser
    def visit(*args)
      goto(*args)
    rescue Ferrum::DeadBrowserError, Ferrum::TimeoutError => exception
      puts "Window handles at #{exception.class} (#{Time.current}): #{window_handles}"
      puts "Restarting browser..."
      Capybara.reset!
      puts "Window handles just after restart (#{Time.current}): #{window_handles}"

      raise exception
    end
  end
end

Since we added this change (and some other debugging to show the window handles in previous tests), we can see this happens after a spec where two window handles existed (from a popup or a second tab being opened). I suspect that the second tab might either not be getting closed or it might be getting closed late, but we're not sure.

route commented 1 year ago

@geoffharcourt do you have a debug log for this test when it's failing?

lancecarlson commented 1 year ago

We were running into these issues intermittently as well. We installed minitest-retry in our system test suite helper and provided those two exceptions to the configuration. Seems to be working ok right now, but certainly a bit of a hack.

route commented 1 year ago

@geoffharcourt @lancecarlson still log with issue is appreciated.

lancecarlson commented 1 year ago

@route what is your recommended why to log the issue? Sorry I don't have much experience with configuring Ferrum because I wasn't the one that setup the CI initially.

ttilberg commented 1 year ago

@lancecarlson You can pass in a file to write the debug log to when you create a browser: link

So for example, you could do Ferrum::Browser.new(logger: $stderr)

or

log = File.open('ferrum.log', 'w')
at_exit { log.close }

browser = Ferrum::Browser.new(logger: log)

geoffharcourt commented 1 year ago

Hi @route apologies for the wait. We had a run with several dead browser errors this morning. Here's a link to the log output. This is really long, I couldn't find what was relevant or useful so I'm just putting it all here for now. https://gist.github.com/geoffharcourt/94f86f75d30b1e2ad90edc5e2f5cba30

route commented 1 year ago

No problem @geoffharcourt I'll take a look

geoffharcourt commented 1 year ago

FWIW, the workaround I posted above turned this from a pretty frustrating situation into one that one retry nearly always works around. It seems like this might be evidence of an issue that's not the usual out of RAM problem.

route commented 1 year ago

I went through the log file and didn't spot any issue regarding the problem, but what I think weird is this line:

7.329515521000076 {"method":"Runtime.callFunctionOn","params":{"returnByValue":true,"objectId":"-8742411280284681957.2.134","functionDeclaration":"function() { return _cuprite.isVisible(this) }","arguments":[]},"id":1366}

7.335774601999901 {"method":"Fetch.enable","params":{"handleAuthRequests":true,"patterns":[{"urlPattern":"*"}]},"id":1367}

7.3531246129998635 {"method":"Target.disposeBrowserContext","params":{"browserContextId":"6DDC3F570DD6804CE57039079ABFCC73"},"id":9}

So we work with a node in DOM and then suddenly calling Fetch.enable again to intercept requests, after that you are disposing the browser's context.

@geoffharcourt Could you try to experiment with turning off blacklist/whitelist options and network interruptions and see if it helps the issue disappear?

route commented 1 year ago

I'll reopen when you provide some info, without it I can't help.

oleksii-leonov commented 1 month ago

We have similar issues when Ferrum::DeadBrowserError exceptions happen for some specs while running on CI. Those exceptions are very hard to reproduce locally in order to debug them.

I did a workaround which works for us quite well, an improved version of @geoffharcourt approach (https://github.com/rubycdp/ferrum/issues/272#issuecomment-1276905127).

https://github.com/kamui/retriable gem is used to wrap retries (but it's trivial to change to plain Ruby or another retry wrapper gem).

# spec/support/capybara/cuprite/browser_with_auto_restart.rb

module Capybara
  module Cuprite
    module BrowserWithAutoRestart
      EXCEPTIONS_TO_CATCH = [
        Ferrum::DeadBrowserError,
        Ferrum::NoSuchTargetError,
        Ferrum::TimeoutError
      ].freeze

      RETRY_PROC = proc do |exception, _try, _elapsed_time, _next_interval|
        Rails.logger.warn("Browser crashed with error: #{exception.class} (#{Time.current})")
        Rails.logger.warn("Restarting browser...")

        Capybara.current_session.driver.browser.restart
        Capybara.reset!

        Rails.logger.warn("Browser restarted (#{Time.current})")
      end

      def visit(...)
        Retriable.retriable(on: EXCEPTIONS_TO_CATCH, on_retry: RETRY_PROC, tries: 3) do
          super
        end
      end
    end
  end
end

Capybara::Cuprite::Browser.prepend(Capybara::Cuprite::BrowserWithAutoRestart)

# spec/system_helper.rb

# Load general RSpec Rails configuration
require "rails_helper"

# Load configuration files and helpers
require "capybara/rails"
require "capybara/rspec"
require "support/capybara/cuprite/browser_with_auto_restart"

# ...

rubycdp / ferrum

Ferrum hangs with DeadBrowserError for some reason other than running out of RAM #272