Open GMNGeoffrey opened 1 year ago
Unassigning myself from issues that I'm not actively working on
With https://github.com/iree-org/iree/issues/18238, we're switching to a new cluster based on https://github.com/actions/actions-runner-controller. Temporary home for docs: https://github.com/saienduri/AKS-GitHubARC-Setup.
@saienduri how does the new setup handle updates to the runner image? Do we want to establish some new process for rolling out updates, or are we relying on ARC to do that for us and just trusting it + GitHub?
We pin the version of the GitHub Actions Runner binary that we use when starting up new runners. This is for stability and security reasons: if GitHub publishes a bad or malicious release, we don't want to automatically update to it. The idea is that on the time scale of human action, GitHub will publish a patch or alert to the malicious release. Generally we wait at least a week after a release before bumping our version. However, it creates a large amount of toil (#11011, #10546, #10246) because we have to update the runners within 30 days of a new version being released. We don't really do any additional vetting before bumping these versions, just relying on soak time. It's pretty common for GitHub to release a patch version within a week of a minor release, so we do get some benefit from waiting there. It also means that we have the ability to canary the update before pushing it to all runners.
Proposal: Dynamically select the runner version during setup such that (in order of precedence):
1 gives 3 days for GitHub to report a malicious release. 2 ensures that the runner will be accepted by GitHub. 3 avoids using a release with some known bug. 4 avoids surprising behavior. 5 gives GitHub 1 week to catch and push a fix for a bug
These can conflict, in which case the earlier constraints will have precedence. So we explicitly prioritize the security aspect by requiring first that the release be at least X days old, then that the runner actually work at all, then that the runner be "stable". This can result in using not the latest patch version of a minor release if GitHub pushes patches separate less than a week for the entire 30 day period.
I think the best way to implement this logic is with a slightly slow-updating mirror. When a new release comes out we only pull it in if it satisfies our conditions. The runners just always pull the latest from the mirror. Maybe we don't even bother mirroring anything but the latest. We could do this logic entirely on the runners themselves, but they'd need something stateful to avoid rolling back revisions due to a patch release.