Open pond opened 4 days ago
As an additional note, should it be useful, we can fork the gem, revert any changes that might be suspected of causing the problem and run CI on a branch of our main code base that's using our forked Ferrum copy.
We've had a very similar experience. We have a large test suite and use custom GitHub action runners on AWS EC2 instances to run headless tests. We see ~4/10 runs fail, typically with something like
1.1) Failure/Error: page.driver.wait_for_network_idle
Ferrum::NoSuchTargetError:
Ferrum::NoSuchTargetError
# /usr/local/bundle/bundler/gems/ferrum-19767d0885af/lib/ferrum/context.rb:51:in `create_target'
# /usr/local/bundle/bundler/gems/ferrum-19767d0885af/lib/ferrum/context.rb:20:in `default_target'
# /usr/local/bundle/gems/cuprite-0.15.1/lib/capybara/cuprite/browser.rb:2[46](https://github.com/fac/freeagent/actions/runs/9809089333/job/27086534389?pr=50903#step:8:48):in `attach_page'
# /usr/local/bundle/gems/cuprite-0.15.1/lib/capybara/cuprite/browser.rb:33:in `page'
# /usr/local/bundle/gems/cuprite-0.15.1/lib/capybara/cuprite/driver.rb:262:in `wait_for_network_idle'
We suspect something in Ferrum is causing the issue, we're finding it difficult to replicate locally. We have also tried pinning Ferrum to the latest master commit but see the same failures
Can you try setting flatten: false
? That's the recent major change in the release
Describe the bug
I'm not sure how to supply a useful bug report here given the bizarre behaviour but - we have a large RSpec test suite including lots of headless Chrome tests that run on AWS CI (CodeBuild / CodePipeline) triggered off GitHub commits. Recently, we updated our bundle which took Cuprite from 0.15.0 to 0.15.1; this in turn requires Ferrum 0.15.0. Our test suite started failing spectacularly, but intermittently (maybe a 70% failure rate, at arbitrary points in the suite, apparently regardless of seed). I'll explain more elsewhere in the template since this section is meant to be "brief" - we see this:
Part of Ferrum appears to then have crashed, or the Chrome instance has, because all subsequent tests fail with:
To Reproduce
This is the problem; it replicates easily in AWS CI but we can't reproduce it locally. It seems that a
TimeoutError
comes fromclient.rb
line 90, exactly as it does if I were to, say, deliberately set thetimeout
option to something very low. When I do this on local machine or in CI in an attempt to provoke replication, I just see things failing "as expected" with:In case it is important - we do then notice that after a few hundred failures like this, suddenly the message changes and subsequent tests say:
I do not know if this is important or just an unrelated minor bug arising from setting timeout so low and Ferrum perhaps not closing down old Chromium instances if its comms time out too soon, since it occurs when we were trying but failing to replicate the nasty crash error.
We have yet to persuade localhost to show the unhandled exception that crashes something out badly, with the "canary" error seen in CI of:
Since this comes from the same piece of code, it's hard to see how it could arise in such different ways unless perhaps the code paths used to reach this part of
client.rb
are very different in each case and one is missing an exception handler.Expected behavior
I would not expect timeouts at all. The suite should run normally. It does with Ferrum 0.14.0 and has for many years with that and prior versions. 0.15.0 introduces the new behaviour. We've pinned to Cuprite 0.15.0 / Ferrum 0.14.0 for now, and CI is working as usual.
If a timeout for real did happen, then I'd expect it to be handled i the usual way:
...rather than a thread termination.
Screenshots
It doesn't really help much but to prove we're not making it up :joy: here's an AWS CI screenshot from the point where things break.
Desktop (please complete the following information):
Additional context
Note that we don't think this is Cuprite, but since Cuprite 0.15.0 only works with Ferrum 0.14.0 and not 0.15.0 and, conversely, Cuprite 0.15.1 only works with Ferrum 0.15.0 and 0.14.0, we can only upgrade or downgrade those two gems in lockstep. I couldn't see anything in Cuprite's
CHANGELOG.md
that looked like it might be a cause, but quickly saw in Ferrum's (excellent, detailed)CHANGELOG.md
some potential causes, yet these could be red herrings. All of them are in one big PR: #432 - a large change set related to comms, threading and exceptions.There is a WebSocket constraint change, but WebSocket was and is still on 0.7 (latest) and works fine, so this is not involved.