Ephemeral Runner FIFO Queue Handeling Inconsistency

version: 5.12.2 lambdas: v5.12.1

Problem Description

We've been dealing with a weird issue with our self-hosted runners, particularly with one of our bigger workflows. This workflow has around ~20 jobs that run in parallel, and it gets triggered a lot. Imagine 5 different developers running this workflow at the same time, one after the other...

Here's the scenario: Dev A starts the workflow, it kicks off, and all is good. Then Dev B starts their workflow while Dev A's is still running, with those 20 jobs already occupying runners. Dev B's workflow just sits there, pending, waiting for runners to become available. Then Devs C, D, and E start their workflows, and now we have 20 more jobs each, all pending, waiting for runners too. No new runners come online. Now, if I cancel Dev D's workflow and restart it, suddenly Dev C's jobs start finding runners and working. Restart it again, and Dev D's workflow begins. Restart it once more, and finally, my jobs start. It’s really confusing.

More information: these are ephemeral runners, FIFO queues, no JIT, spot instances (we're not seeing spot capacity problems, and we have fallback to on-demand configured. Also this problem only seems to happen on the ephemeral runners, not on idle runners).

Also this is a very intermittent issue, it can take take weeks before it happens again, but it will happen again. I'm just trying to track down the problem and it's a big challenging.

Any ideas of what's going on or how to best track down the source of the FIFO queue misalignment?

philips-labs / terraform-aws-github-runner

Ephemeral Runner FIFO Queue Handeling Inconsistency #4011