ublue-os / main

OCI base images of Fedora with batteries included
https://universal-blue.org/images/main/
Apache License 2.0
487 stars 43 forks source link

Automated retries? #502

Closed dylanmtaylor closed 2 months ago

dylanmtaylor commented 7 months ago

I see that build actions sometimes fail. I think we should leverage the retry action on ublue builds with an attempt limit of 3. https://github.com/marketplace/actions/retry-action

That way if it's a weird network issue it something we won't have a day without a new image.

bsherman commented 7 months ago

I've added requested changes to the associated PR ( #503 ), but I'll add some thoughts here for good measure.

@dylanmtaylor is very much correct that we sometimes have spurious failures (various causes which likely include network issues), and those result in the team needing to manually retry some runs of the workflow. As an example, a spurious failure will usually result in success for most of the matrix options, but one or two will fail.

I do agree that automatically retrying certain steps of the workflow will be helpful.

I identified the two most useful in my requested changes:

  1. Get current version was recently discovered to have intermittent but SILENT failures, so I fixed it to actually halt workflow execution rather than result in images which contain incomplete metadata. This is a simple step and a great candidate for automatic retry. In every failure case I've seen, it has been clear that it was some intermittent issue in network or on the ghcr servers which caused the failure.
  2. Push to GHCR is also a great candidate, since, if we reach this step, all that must be done is publish the cleanly built image. However, this does fail on occasion, and the fault is always some issue with network or on the GHCR server side. Likely our very large image sizes don't help us here.

I've specifically requested that we do NOT auto-retry the most complex step, Build Image. The most common causes of failure here are legitimate, usually due to an upstream RPM dependency issue. The one spurious issue I do know of in Build Image is related to the github-release-install.sh shell script which helps us install RPM packages direct from a project's github release. This is where I'd like to see an improvement to the shell script to handle those failures and retry internally. I've already made one such attempt with only partial success.

In addition to all this, I'd really like to see these improvements in ublue-os/main... but I hesitate to implement in the 6 other "foundational"/"hardware enablement" repos we maintain. We've already had some discussions on merging and cleaning them up as it's currently very messy to maintain them all as distinct repos.

Hope that provides some context to any reader regarding my views on this topic.

bsherman commented 2 months ago

Actually, i think we should close this as "done" since we merged the PR at the top and have continued to add appropriate retry logic in various places throughout the project.