oxidecomputer / buildomat

a software build labour-saving device
Mozilla Public License 2.0
53 stars 2 forks source link

Published artefacts sometimes do not match the actual artefact from the buildomat job #38

Closed internet-diglett closed 10 months ago

internet-diglett commented 11 months ago

The dendrite-stub.tar.gz from the buildomat job has a sha256sum of 93e00c96bcf9415f2a04775326d243154eaf7f0590140cc78dad11d689ff372a.

The dendrite-stub.tar.gz published is showing a sha256sum of 8fc95653fc316f66a3468a216930823aa5827ca3583e1338b214358ac9a70993.

The second archive is a valid archive, just not the one from the job ran from commit 9523cff22405c4ca5f4fb81e77bd4cb5dbcec111.

A subsequent commit had the correct artifact published. A commit after that was mismatched again.

jclulow commented 10 months ago

I did some investigation into this late last week and I believe that the core problem here is GitHub has started issuing more than one Check Suite for some subset of commits. This almost never used to happen, but according to the database it's happening thousands of times a month since about September 2023, so obviously something is different now

The upshot is that buildomat produces Check Runs for both Suites, and sometimes the Suite that GitHub chooses to display for a commit ends up being a different one to the set of jobs that won the race to publish artefacts.

I'll be working on a solution this week.

internet-diglett commented 10 months ago

Thank you for looking into this @jclulow, I know you have a lot on your plate!

jclulow commented 10 months ago

After a lot of debugging, and staring into database records and web hook delivery records and poking at the GitHub API I believe I have figured out what is causing most of these issues. Due to some gross architectural misstep, GitHub will send at least some check_suite/requested web hook events to GitHub applications that are for other GitHub applications.

It seems like we've added some new automation at Oxide in the last couple of months, so we're now routinely getting check suite requests for the Oxide cio-bot as well as our own (for buildomat). They same to duplicate if not every check suite we'd otherwise expect, then at least most of the suites for important repositories like oxidecomputer/omicron.

I have added, in c68598c379974579687209ea11d997044a4c69e9, a check to make sure that when we process these requests, we're ignoring requests that should have been routed to applications other than our own. This has already fired at least five times since I've been watching it, so I'm reasonably confident that it's what's been happening.