philips-labs / terraform-aws-github-runner

Terraform module for scalable GitHub action runners on AWS
https://philips-labs.github.io/terraform-aws-github-runner/
MIT License
2.53k stars 600 forks source link

Ephemeral Runner FIFO Queue Handeling Inconsistency #4011

Open rsavage-nozominetworks opened 1 month ago

rsavage-nozominetworks commented 1 month ago

version: 5.12.2 lambdas: v5.12.1

Problem Description

We've been dealing with a weird issue with our self-hosted runners, particularly with one of our bigger workflows. This workflow has around ~20 jobs that run in parallel, and it gets triggered a lot. Imagine 5 different developers running this workflow at the same time, one after the other...

Here's the scenario: Dev A starts the workflow, it kicks off, and all is good. Then Dev B starts their workflow while Dev A's is still running, with those 20 jobs already occupying runners. Dev B's workflow just sits there, pending, waiting for runners to become available. Then Devs C, D, and E start their workflows, and now we have 20 more jobs each, all pending, waiting for runners too. No new runners come online. Now, if I cancel Dev D's workflow and restart it, suddenly Dev C's jobs start finding runners and working. Restart it again, and Dev D's workflow begins. Restart it once more, and finally, my jobs start. It’s really confusing.

More information: these are ephemeral runners, FIFO queues, no JIT, spot instances (we're not seeing spot capacity problems, and we have fallback to on-demand configured. Also this problem only seems to happen on the ephemeral runners, not on idle runners).

Also this is a very intermittent issue, it can take take weeks before it happens again, but it will happen again. I'm just trying to track down the problem and it's a big challenging.

Any ideas of what's going on or how to best track down the source of the FIFO queue misalignment?

rutomo-humi commented 1 month ago

We experienced a similar issue intermittently and were able to replicate it by triggering 5 workflows almost at the same time. Workflow 1 to 3 were able to start without any issues but Workflow 4 and 5 got stuck until I started another Workflow (Workflow 6) which then Workflow 4 and 5 were able to start partially

Version: 5.7.1 Lambdas: v5.7.1

These are ephemeral runners, FIFO queues and on-demand instances. I also would like to know how to troubleshoot this issue or if there's any configuration I can tweak.

I also found this old issue which is possibly related.