Closed Edward-Knight closed 9 months ago
Thank you very much for doing this research and filing issues to track shoring up CI reliability!
The image caching has sped up the CI times on my fork massively, thanks for fixing those issues @indygreg!
Closing this out as CI is now nice and reliable due to the caches :)
Looking at the CI results from this week (2023-06-23 – 2023-06-16), I'm seeing some consistent failures. I've dug into some of them to try and diagnose the problem.
Failures
Linux Failures
The
image (xcb.cross)
job is failing in every workflow sampled, with other image build jobs failing sporadically. It looks like all jobs are failing for the same reasons, andimage (xcb.cross)
just happens to try and pull more packages. Of the 22 failures, all of them are during apt update or install operations, pulling from snapshot.debian.org. I've broken them down by the error message (although some logs contain multiple errors):From my experience with apt and reprepro, I know that:
Since these are "frozen" snapshots, I assume they aren't being updated, so I'm chalking all these failures up to the host being unreliable.
I found this on https://wiki.debian.org/BisectDebian:
And the current rules can be seen here: https://salsa.debian.org/dsa-team/mirror/dsa-puppet/-/blob/production/modules/roles/manifests/snapshot_web.pp.
To fix, we could:
Acquire::Retries "5"
onbuild.cross.Dockerfile
andbase.Dockerfile
.Acquire::Queue-Mode "host"
to avoid opening so many connectionsDebug::Acquire::http
andDebug::pkgAcquire::Auth
to dig furtherIMO caching the Docker images seems like the way to go, we can use GitHub's Container Registry for this.
Windows Failures
Both Windows jobs failed in the same way in the "Install Cygwin Environment" step:
We're not using the latest version of this Action, but this code looks unchanged. Looks like a sporadic network failure. It's unclear where exactly the problem is, could potentially be made reliable with a retry mechanism.
Proposed fixes
178
179
I believe if the above issues are resolved then CI should be reliable.