heroku / heroku-buildpack-python

Heroku's buildpack for Python applications.
https://www.heroku.com/python
MIT License
973 stars 1.83k forks source link

Enable rspec-retry in CI #1594

Closed edmorley closed 1 month ago

edmorley commented 1 month ago

In the last week or two, the Heroku-24 jobs in CI have started failing a very high percentage of the time due to build log output assertion errors. These assertions are failing since some expected log lines are intermittently not seen in the build logs. Whilst the individual failure rate is low, given CI runs ~60 builds, this results in a near permanent failure rate.

Attempting to reproduce locally using a minimal logging-only inline buildpack testcase failed to reproduce. I eventually found that the issue only occurs when there is a reasonably long delay (eg 10 seconds) between a log entry and the next - in which case the first message (and sometimes two) after the log output resumes can be dropped.

This issue only affects the logs streamed via endosome, and does not affect the stored log on S3.

Whilst the intermittent failures in CI were only being seen on Heroku-24, the issue also reproduces on older stacks when using fixed sleep times in this testcase. As such, this appears to be an existing logging bug, that's only been exposed on Heroku-24 due to its builds running on different infrastructure, presumably now causing the time between some build steps (and log output entries) to be slower.

The builds team are looking into both the existing logging bug and the perf variances of the new build infrastructure. However, to make CI green in the meantime we'll have to use rspec-retry: https://github.com/NoRedInk/rspec-retry

(rspec-retry differs from Hatchet's built-in retry mechanism, in that it will retry failures outside of the deploy itself, plus doesn't suffer from rebuilds now being potentially cached builds, throwing off the test assertions.)

See also: https://salesforce-internal.slack.com/archives/C01068P24S3/p1717769443967999

GUS-W-15978877.