openfoodfoundation / openfoodnetwork

Connect suppliers, distributors and consumers to trade local produce.
https://www.openfoodnetwork.org
GNU Affero General Public License v3.0
1.12k stars 724 forks source link

[Flaky] Ferrum::ProcessTimeoutError #10018

Open filipefurtad0 opened 1 year ago

filipefurtad0 commented 1 year ago

What we should change and why (this is tech debt)


Failures:

  1) Credit Cards as a logged in user passes the smoke test
     Failure/Error: visit "/account"

     Ferrum::ProcessTimeoutError:
       Browser did not produce websocket url within 20 seconds, try to increase `:process_timeout`. See https://github.com/rubycdp/ferrum#customization

     [Screenshot Image]: /home/runner/work/openfoodnetwork/openfoodnetwork/tmp/capybara/screenshots/failures_r_spec_example_groups_credit_cards_as_a_logged_in_user_passes_the_smoke_test_393.png

Context

https://github.com/openfoodfoundation/openfoodnetwork/actions/runs/3487271563/jobs/5834704950

Impact and timeline

filipefurtad0 commented 1 year ago

This is also coming up on this branch - not yet in master: https://github.com/openfoodfoundation/openfoodnetwork/actions/runs/3489852498/jobs/5840503897

I wonder if this somehow relates to GH Actions?

EDIT: or perhaps https://github.com/openfoodfoundation/openfoodnetwork/pull/9986?

filipefurtad0 commented 1 year ago

Re-occurred here: https://github.com/openfoodfoundation/openfoodnetwork/actions/runs/3515094395/jobs/5889927725

jibees commented 1 year ago

Each time a different spec.

sigmundpetersen commented 1 year ago

Yes, #9986 seems relevant, with this change re. timeouts https://github.com/rubycdp/cuprite/pull/215

dacook commented 1 year ago

Hmm, so it doesn't seem to be related to the new Knapsack setup. Thanks for digging up the recent changes Sigmund, that seems to be related. I guess the next step then is to try downgrading Cuprite to see if that resolves the issue. If so, we could submit a bug to Cuprite, which hopefully can be resolved. If not, then we can try increasing the timeout.

filipefurtad0 commented 1 year ago

As I understand from this discussion, what the cuprite bump introduces is an output to provide more information when the timeout error occurs. So downgrading it shuld only remove that output, and probably not fix the error.

The good news is that it is not introduced by Knapsack; the bad news is that even when after lowering the number of nodes it sometimes occurs, like here: https://github.com/openfoodfoundation/openfoodnetwork/actions/runs/3647209594/jobs/6159168431

I've contacted Knapsack support for any advise and will continue to investigate. Also opened an issue on the Ferrum repo.

filipefurtad0 commented 1 year ago

I've reproduced a similar error locally:

  1) 
    As an admin
    I want to set a supplier and distributor(s) for a product
 as anonymous user is redirected to login page when attempting to access product listing
     Failure/Error: expect { visit spree.admin_products_path }.not_to raise_error

       expected no Exception, got #<Ferrum::TimeoutError: Ferrum::TimeoutError> with backtrace:
         # ./spec/system/admin/products_spec.rb:25:in `block (4 levels) in <main>'
         # ./spec/system/admin/products_spec.rb:25:in `block (3 levels) in <main>'
         # ./spec/system/support/cuprite_setup.rb:41:in `block (2 levels) in <main>'
         # -e:1:in `<main>'

This happened while running three terminal windows, and repeating the same example in parallel, using the ./script/rspec-slow-repeat script.

One idea could be to split the system tests into two runner machines on Github Actions, one for /admin and the other for /consumer tests. I'll make a PR and see if it still occurs.

Also, maybe relevant:

We use Ubuntu 20.04, but it seems macOS seem better performant (+1 core CPU, +7 GB RAM), as indicated here:

Hardware specification for Windows and Linux virtual machines:

    2-core CPU (x86_64)
    7 GB of RAM
    14 GB of SSD space

Hardware specification for macOS virtual machines:

    3-core CPU (x86_64)
    14 GB of RAM
    14 GB of SSD space

I wonder if migrating the build to macOS would improve this?

sigmundpetersen commented 1 year ago

Do I understand this table https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits correctly in that we would get only 5 concurrent jobs if migrating to macOS ?

filipefurtad0 commented 1 year ago

Humm, seems to be that way indeed. In that case, I guess we're better off with the 60 concurrent jobs in Ubuntu :+1:

sigmundpetersen commented 1 year ago

Let's keep an eye on this after merging #10127 and close if it doesn't reoccur.

sigmundpetersen commented 1 year ago

I'm afraid this happened (a lot) again https://github.com/openfoodfoundation/openfoodnetwork/actions/runs/3702598137 😭

filipefurtad0 commented 1 year ago

Thanks for reporting @sigmundpetersen - let's move it back to In Dev, in that case :+1:

filipefurtad0 commented 1 year ago

Just to be sure, when you mean "a lot" @sigmundpetersen you mean happening 4 times on the same build run - like the example you've pointed out - is this correct?

sigmundpetersen commented 1 year ago

Just to be sure, when you mean "a lot" @sigmundpetersen you mean happening 4 times on the same build run - like the example you've pointed out - is this correct?

Exactly

Haven't seen it much else lately on master build though. So maybe just a one off? Maybe the Github Action servers/nodes were very busy during that sepcific build? We could just let the issue sit for a while and monitor the frequency. What do you think?

There's also the Ferrum::DeadBrowserError happening once in a while:

6) Product Import when dealing with uploaded files handles cases where files contain malformed data
     Got 0 failures and 3 other errors:

     6.1) Failure/Error: let!(:enterprise) { create(:supplier_enterprise, owner: user, name: "User Enterprise") }

          ActiveRecord::RecordInvalid:
            Validation failed: Name has already been taken. If this is your enterprise and you would like to claim ownership, or if you would like to trade with this enterprise please contact the current manager of this profile at sharee.heidenreich@flatley.co.uk.
          # <internal:kernel>:90:in `tap'
          # ./spec/system/admin/product_import_spec.rb:14:in `block (2 levels) in <main>'
          # ./spec/system/support/cuprite_setup.rb:41:in `block (2 levels) in <top (required)>'

     6.2) Failure/Error: return super unless Capybara.last_used_session

          Ferrum::DeadBrowserError:
            Browser is dead or given window is closed
          # <internal:kernel>:90:in `tap'
          # ./spec/system/support/cuprite_helpers.rb:25:in `take_screenshot'
          # ./spec/system/support/cuprite_setup.rb:41:in `block (2 levels) in <top (required)>'

     6.3) Failure/Error: example.run

          Ferrum::DeadBrowserError:
            Browser is dead or given window is closed
          # ./spec/system/support/cuprite_setup.rb:41:in `block (2 levels) in <top (required)>'

https://github.com/openfoodfoundation/openfoodnetwork/actions/runs/3687395479/jobs/6240930191

Should we file an issue on it?

filipefurtad0 commented 1 year ago

Haven't seen it much else lately on master build though. So maybe just a one off? Maybe the Github Action servers/nodes were very busy during that sepcific build?

Could be. I have not seen it happening much either.

We could just let the issue sit for a while and monitor the frequency. What do you think?

Agree, let's do that :+1: I'll move to tech debt prioritized instead.

Ferrum::DeadBrowserError Should we file an issue on it?

This has been reported at least on these two occasions here, here, also related here - and the consensus seems to be around a RAM issue, which I guess is external to us.

Although not introduced by Knapsack, I've reached out to them, and I've received advice on what could eventually improve the situation - maybe this is good to keep in mind:

filipefurtad0 commented 1 year ago

I've noticed this one yesterday and today again: https://github.com/openfoodfoundation/openfoodnetwork/actions/runs/3901037095/jobs/6662406732

sigmundpetersen commented 1 year ago

Could we try adding the no-sandbox browser option to the Cuprite config? Like here

It's also mentioned in the readme if used with docker. And I think GH Actions is run with docker.