Closed MoFtZ closed 1 year ago
Hi @MoFtZ . This of course should not happen. Self-hosted runners should pick up only jobs with "cuda" in the name. I don't see "cuda" in the name of the job(s) in the 'Update Copyright Year' workflow. How often this happens? Did you find any kind of rule that would reproduce the issue?
Runner selection is based on the runs-on
setting in the workflow. We are able to explicitly chose the Cuda runners by using the "cuda" in the name, so that's not the issue.
The issue appears to be that we specify ubuntu-latest
for the Github runners. My understanding is that, if a Cuda runner is active and available, it is allowed to be chosen, since it also satisfies the ubuntu-latest
requirements.
This is probably the second or third time I have seen the failure to install the .NET SDK, over the last few days. It was not just Update Copyright Year workflow. I thought I saw it on the main CI workflow too. I originally thought it was an intermittent issue with GitHub, so I just re-ran the step. But since I noticed the Update Copyright Year using the Cuda runner, I suspect the previous failure is the same.
@MoFtZ All right, i'll try to reproduce the issue and see what is the best way to fix it. Thanks for reporting the issue!
Here is another instance where the self-hosted Cuda runner was used: https://github.com/m4rs-mt/ILGPU/actions/runs/4199363759/jobs/7284136034
Manually triggered action incorrectly picked self-hosted runner: https://github.com/m4rs-mt/ILGPU/actions/runs/4346797745
Another instance - Copyright Year workflow used self-hosted runner. https://github.com/m4rs-mt/ILGPU/actions/runs/4379977082/jobs/7666513727
Hi @MoFtZ . Thanks for the additional info. I'll see what i can do
Hi @MoFtZ . I am still investigating this issue. Sadly i can't reproduce it. Maybe you can give me a step-by-step guide on how to reproduce the issue.
When it comes to the runs-on
labels, i've taken additional look. When self-hosted runners are deployed, they are labeled with a set of these labels (eg: self-hosted
, Linux
, X64
, cuda-4574963232-3-ILGPU-net6.0
) or similar, of course, the last one will be different as the ID of the job is used, so it's set dynamically. We don't label the runners with ubuntu-latest
. So i am still struggling to understand how would self hosted runners steal jobs from other workflows
Unfortunately, I do not know how to reproduce the issue. It appears to happen when there are self-hosted runners available, and a scheduled build occurs.
It failed again today: https://github.com/m4rs-mt/ILGPU/actions/runs/4570179246
My initial thought was ubuntu-latest
was acting as an alias for Linux
andx64
, but I have no proof of that.
All right, thanks. I did some testing, looks like ubuntu-latest
is not an alias for Linux
and x64
. I've tested it, and job didn't run. I'll continue
We might need to raise a defect with Github. We know it happens, but the rules for picking a runner are unknown to us.
Hi @MoFtZ @m4rs-mt . I think this is now fixed, correct?
I think the problem is now fixed.
The 'Update Copyright Year' workflow fails intermittently when installing the .NET SDK: https://github.com/m4rs-mt/ILGPU/actions/runs/4258018880/jobs/7408784738
From the logs, it looks like it was run on the self-hosted Cuda runner, rather than a Github-hosted runner.
@pavlovic-ivan @m4rs-mt Does anyone know how to prevent a self-hosted runner from being selected?