m4rs-mt / ILGPU

ILGPU JIT Compiler for high-performance .Net GPU programs
http://www.ilgpu.net
Other
1.41k stars 120 forks source link

CI incorrectly using self-hosted runner #952

Closed MoFtZ closed 1 year ago

MoFtZ commented 1 year ago

The 'Update Copyright Year' workflow fails intermittently when installing the .NET SDK: https://github.com/m4rs-mt/ILGPU/actions/runs/4258018880/jobs/7408784738

From the logs, it looks like it was run on the self-hosted Cuda runner, rather than a Github-hosted runner.

@pavlovic-ivan @m4rs-mt Does anyone know how to prevent a self-hosted runner from being selected?

pavlovic-ivan commented 1 year ago

Hi @MoFtZ . This of course should not happen. Self-hosted runners should pick up only jobs with "cuda" in the name. I don't see "cuda" in the name of the job(s) in the 'Update Copyright Year' workflow. How often this happens? Did you find any kind of rule that would reproduce the issue?

MoFtZ commented 1 year ago

Runner selection is based on the runs-on setting in the workflow. We are able to explicitly chose the Cuda runners by using the "cuda" in the name, so that's not the issue.

The issue appears to be that we specify ubuntu-latest for the Github runners. My understanding is that, if a Cuda runner is active and available, it is allowed to be chosen, since it also satisfies the ubuntu-latest requirements.

This is probably the second or third time I have seen the failure to install the .NET SDK, over the last few days. It was not just Update Copyright Year workflow. I thought I saw it on the main CI workflow too. I originally thought it was an intermittent issue with GitHub, so I just re-ran the step. But since I noticed the Update Copyright Year using the Cuda runner, I suspect the previous failure is the same.

pavlovic-ivan commented 1 year ago

@MoFtZ All right, i'll try to reproduce the issue and see what is the best way to fix it. Thanks for reporting the issue!

MoFtZ commented 1 year ago

Here is another instance where the self-hosted Cuda runner was used: https://github.com/m4rs-mt/ILGPU/actions/runs/4199363759/jobs/7284136034

MoFtZ commented 1 year ago

Manually triggered action incorrectly picked self-hosted runner: https://github.com/m4rs-mt/ILGPU/actions/runs/4346797745

MoFtZ commented 1 year ago

This PR looks like it would fix the issue.

MoFtZ commented 1 year ago

Another instance - Copyright Year workflow used self-hosted runner. https://github.com/m4rs-mt/ILGPU/actions/runs/4379977082/jobs/7666513727

pavlovic-ivan commented 1 year ago

Hi @MoFtZ . Thanks for the additional info. I'll see what i can do

pavlovic-ivan commented 1 year ago

Hi @MoFtZ . I am still investigating this issue. Sadly i can't reproduce it. Maybe you can give me a step-by-step guide on how to reproduce the issue.

When it comes to the runs-on labels, i've taken additional look. When self-hosted runners are deployed, they are labeled with a set of these labels (eg: self-hosted, Linux, X64, cuda-4574963232-3-ILGPU-net6.0) or similar, of course, the last one will be different as the ID of the job is used, so it's set dynamically. We don't label the runners with ubuntu-latest. So i am still struggling to understand how would self hosted runners steal jobs from other workflows

MoFtZ commented 1 year ago

Unfortunately, I do not know how to reproduce the issue. It appears to happen when there are self-hosted runners available, and a scheduled build occurs.

It failed again today: https://github.com/m4rs-mt/ILGPU/actions/runs/4570179246

My initial thought was ubuntu-latest was acting as an alias for Linux andx64, but I have no proof of that.

pavlovic-ivan commented 1 year ago

All right, thanks. I did some testing, looks like ubuntu-latest is not an alias for Linux and x64. I've tested it, and job didn't run. I'll continue

MoFtZ commented 1 year ago

We might need to raise a defect with Github. We know it happens, but the rules for picking a runner are unknown to us.

pavlovic-ivan commented 1 year ago

Hi @MoFtZ @m4rs-mt . I think this is now fixed, correct?

m4rs-mt commented 1 year ago

I think the problem is now fixed.