microsoft / azurelinux

Linux OS for Azure 1P services and edge appliances
MIT License
4.31k stars 547 forks source link

Fix bad interactions between timeouts and build retires #10480

Closed dmcilvaney closed 2 months ago

dmcilvaney commented 2 months ago
Merge Checklist

All boxes should be checked before merging the PR (just tick any boxes which don't apply to this PR)


Summary

When we queue a package to build (or test), we set a timeout (by default 8h). If the build has not finished by then we forcibly stop the build and mark it as failed.

We also support PACKAGE_BUILD_RETRIES and CHECK_BUILD_RETRIES, which will cause failed builds to re-run.

However, each time the retry was triggered the timeout would reset. For example in the buddy builds this means that a stuck package test could take 4x8=32h to build, which would exceed pipeline time limits. We want to exit gracefully with an error state so that we can generate and publish logs correctly. If the pipeline forces the timeout, it can be difficult to debug.

Instead of resetting the timeout with each retry, have all attempts share a single timeout. If the timeout is exceeded stop retrying (use RunWithLinearBackoff() which will take a ctx configured with a timeout, so we can break out early).

As part of this fix, I also noticed that the timeout handling was not cleaning up the build chroot correctly. We should not be using anything related to panic() for error handling, instead use logger.Log.Fatal*() which gives the logging library a chance to run its registered cleanup functions (ie final chroot cleanup) before exiting "gracefully".

Change Log
Does this affect the toolchain?

NO

Associated issues
Test Methodology

(Added custom %check to words with sleep 9h)