When starting lots of jobs in parallel with github actions strategy matrix, 30-50% of them are not picked up

almostintuitive commented 1 year ago

Hi!

We're using the library now in production, and it has been extremely useful for us! (our config is very simple: --max-runners 40, recycling on). The only problem we're facing is: when we're startuing let's say 5 or 10 in parallel, then 30-50% of the jobs are marked as "failed job".

I was trying to look at the logs but the only error message I'm finding is this one:

06:24:20 scale_down ERROR ❌ APIException: cannot perform operation because server is locked

I'm now trying to look in depth whether it's some kind of instance creation timeout, while keeping the page open before to see what is happening before github just declares it a "failed job", hopefully have an update soon!

almostintuitive commented 1 year ago

One level up we see this message:

almostintuitive commented 1 year ago

unfortunately it looks like it's not a problem of not picking up jobs, but where hetzner is killing our workflows during runtime, so nothing to do with this library! sorry.

vzakaznikov commented 1 year ago

Ok, let me know if anything comes up. I would be happy to help.

almostintuitive commented 1 year ago

thanks!:) actually it resolved automagically...

testflows / TestFlows-GitHub-Hetzner-Runners

When starting lots of jobs in parallel with github actions strategy matrix, 30-50% of them are not picked up #9