All boxes should be checked before merging the PR (just tick any boxes which don't apply to this PR)
[x] The toolchain has been rebuilt successfully (or no changes were made to it)
[x] The toolchain/worker package manifests are up-to-date
[x] Any updated packages successfully build (or no packages were changed)
[x] Packages depending on static components modified in this PR (Golang, *-static subpackages, etc.) have had their Release tag incremented.
[x] Package tests (%check section) have been verified with RUN_CHECK=y for existing SPEC files, or added to new SPEC files
[x] All package sources are available
[x] cgmanifest files are up-to-date and sorted (./cgmanifest.json, ./toolkit/scripts/toolchain/cgmanifest.json, .github/workflows/cgmanifest.json)
[x] LICENSE-MAP files are up-to-date (./LICENSES-AND-NOTICES/SPECS/data/licenses.json, ./LICENSES-AND-NOTICES/SPECS/LICENSES-MAP.md, ./LICENSES-AND-NOTICES/SPECS/LICENSE-EXCEPTIONS.PHOTON)
[x] All source files have up-to-date hashes in the *.signatures.json files
[x] sudo make go-tidy-all and sudo make go-test-coverage pass
[x] Documentation has been updated to match any changes to the build system
[x] Ready to merge
Summary
When we queue a package to build (or test), we set a timeout (by default 8h). If the build has not finished by then we forcibly stop the build and mark it as failed.
We also support PACKAGE_BUILD_RETRIES and CHECK_BUILD_RETRIES, which will cause failed builds to re-run.
However, each time the retry was triggered the timeout would reset. For example in the buddy builds this means that a stuck package test could take 4x8=32h to build, which would exceed pipeline time limits. We want to exit gracefully with an error state so that we can generate and publish logs correctly. If the pipeline forces the timeout, it can be difficult to debug.
Instead of resetting the timeout with each retry, have all attempts share a single timeout. If the timeout is exceeded stop retrying (use RunWithLinearBackoff() which will take a ctx configured with a timeout, so we can break out early).
As part of this fix, I also noticed that the timeout handling was not cleaning up the build chroot correctly. We should not be using anything related to panic() for error handling, instead use logger.Log.Fatal*() which gives the logging library a chance to run its registered cleanup functions (ie final chroot cleanup) before exiting "gracefully".
Change Log
Package build timeout shared by all retry attempts, each invocation of BuildAgent.BuildPacakge() now takes a time.Duration instead of using the value from BuildAgentConfig.
Properly clean up build chroot on timeout
Handle timeout logic inside the chroot.Run so we correctly exit the chroot before leaving the function, otherwise the chroot cleanup code will run from within the chroot itself and the paths will be wrong.
Add a new StopAllChildProcesses() which is like PermanentlyStopAllChildProcesses() but does not set the disable flag (so we can run the gpg-agent cleanup still on exit).
Merge Checklist
All boxes should be checked before merging the PR (just tick any boxes which don't apply to this PR)
*-static
subpackages, etc.) have had theirRelease
tag incremented../cgmanifest.json
,./toolkit/scripts/toolchain/cgmanifest.json
,.github/workflows/cgmanifest.json
)./LICENSES-AND-NOTICES/SPECS/data/licenses.json
,./LICENSES-AND-NOTICES/SPECS/LICENSES-MAP.md
,./LICENSES-AND-NOTICES/SPECS/LICENSE-EXCEPTIONS.PHOTON
)*.signatures.json
filessudo make go-tidy-all
andsudo make go-test-coverage
passSummary
When we queue a package to build (or test), we set a timeout (by default 8h). If the build has not finished by then we forcibly stop the build and mark it as failed.
We also support
PACKAGE_BUILD_RETRIES
andCHECK_BUILD_RETRIES
, which will cause failed builds to re-run.However, each time the retry was triggered the timeout would reset. For example in the buddy builds this means that a stuck package test could take 4x8=32h to build, which would exceed pipeline time limits. We want to exit gracefully with an error state so that we can generate and publish logs correctly. If the pipeline forces the timeout, it can be difficult to debug.
Instead of resetting the timeout with each retry, have all attempts share a single timeout. If the timeout is exceeded stop retrying (use
RunWithLinearBackoff()
which will take actx
configured with a timeout, so we can break out early).As part of this fix, I also noticed that the timeout handling was not cleaning up the build chroot correctly. We should not be using anything related to
panic()
for error handling, instead uselogger.Log.Fatal*()
which gives the logging library a chance to run its registered cleanup functions (ie final chroot cleanup) before exiting "gracefully".Change Log
BuildAgent.BuildPacakge()
now takes atime.Duration
instead of using the value fromBuildAgentConfig
.StopAllChildProcesses()
which is likePermanentlyStopAllChildProcesses()
but does not set the disable flag (so we can run the gpg-agent cleanup still on exit).Does this affect the toolchain?
NO
Associated issues
Test Methodology
(Added custom %check to words with
sleep 9h
)