indygreg / python-build-standalone

Produce redistributable builds of Python
BSD 3-Clause "New" or "Revised" License
1.71k stars 107 forks source link

GitHub Actions CI unreliability #177

Closed Edward-Knight closed 9 months ago

Edward-Knight commented 1 year ago

Looking at the CI results from this week (2023-06-23 – 2023-06-16), I'm seeing some consistent failures. I've dug into some of them to try and diagnose the problem.

Failures

Linux Failures

The image (xcb.cross) job is failing in every workflow sampled, with other image build jobs failing sporadically. It looks like all jobs are failing for the same reasons, and image (xcb.cross) just happens to try and pull more packages. Of the 22 failures, all of them are during apt update or install operations, pulling from snapshot.debian.org. I've broken them down by the error message (although some logs contain multiple errors):

From my experience with apt and reprepro, I know that:

Since these are "frozen" snapshots, I assume they aren't being updated, so I'm chalking all these failures up to the host being unreliable.

I found this on https://wiki.debian.org/BisectDebian:

At least as of 2019 snapshot.debian.org blocks access from AWS and has rate-limits for all addresses making installing from snapshots.debian.org unreliable

And the current rules can be seen here: https://salsa.debian.org/dsa-team/mirror/dsa-puppet/-/blob/production/modules/roles/manifests/snapshot_web.pp.

To fix, we could:

IMO caching the Docker images seems like the way to go, we can use GitHub's Container Registry for this.

Windows Failures

Both Windows jobs failed in the same way in the "Install Cygwin Environment" step:

Invoke-WebRequest https://cygwin.com/setup-$platform.exe -OutFile C:\setup.exe
  shell: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.EXE -command ". '{0}'"
Invoke-WebRequest : The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel.

We're not using the latest version of this Action, but this code looks unchanged. Looks like a sporadic network failure. It's unclear where exactly the problem is, could potentially be made reliable with a retry mechanism.

Proposed fixes

I believe if the above issues are resolved then CI should be reliable.

indygreg commented 11 months ago

Thank you very much for doing this research and filing issues to track shoring up CI reliability!

Edward-Knight commented 11 months ago

The image caching has sped up the CI times on my fork massively, thanks for fixing those issues @indygreg!

Edward-Knight commented 9 months ago

Closing this out as CI is now nice and reliable due to the caches :)