CI : default_cert test fails way too often / randomly on GitHub Actions

buchdag commented 3 years ago

The default_cert test is failing so often and so randomly on GitHub Actions that I had to remove it, as GHA does not really have a "allow failure" like Travis or other CI systems, nor a mechanism to restart single tests, which mean I end up restarting the whole tests run sometimes 6 or 8 times in a row.

The test appears to be fine on local but I haven't used local tests very often lastly so I'm not 100% sure.

I tried to double the timeout before the test fails (60 to 120 seconds) to no avail, and I doubt this was a timeout issue to begin with. It might very well be an issue with the feature itself.

If anyone is willing to investigate this, help would be appreciated.

polarathene commented 2 years ago

GHA does not really have a "allow failure" like Travis or other CI systems

Yes it does, see how I've done so here. Use continue-on-error: true. Another approach is handled in a separate workflow, that always ensures a step is run with if: ${{ always() }}.

nor a mechanism to restart single tests, which mean I end up restarting the whole tests run sometimes 6 or 8 times in a row.

I believe that's possible, but it's not something I've tried myself. If you can return output about the failure and what test needs to be run again, I think that can be used to trigger a re-run with the returned failures as new input for the test to only cover.

Probably similar to how I've got a workflow split into two workflows (build and deploy for PR doc previews), the 2nd part is only triggered when the 1st part has completed successfully. Note the job condition: if: ${{ github.event.workflow_run.event == 'pull_request' && github.event.workflow_run.conclusion == 'success' }}, while the 1st part of the split workflow also ensures stale runs are canceled (new commit pushed for a PR running a preview docs build).

If anyone is willing to investigate this, help would be appreciated.

I don't have time myself to contribute towards that, but I can say that I've found using bats to be pretty great for running shell script based tests. I'm slowly refactoring our test-suite, but a good example that I recently covered was our test for DH params.

We have a variety of helper functions that you're welcome to use :)

buchdag commented 2 years ago

Thanks for the tips @polarathene, I'll look into all of this 😃

polarathene commented 2 years ago

While working on a PR for nginx-proxy, I noticed they migrated away from bats test suite to one with python. Since you're maintaining both from what I understand, perhaps that'd make sense to adopt here (if you or any other maintainer ever does find the time to rewrite/port the tests).

buchdag commented 2 years ago

Thats pretty much what I had in mind long term (migrating to pytest instead of the jury-rigged bash test suite I wrote), but yeah that will be some heavy work.

nginx-proxy / acme-companion

CI : default_cert test fails way too often / randomly on GitHub Actions #786