About GitHub runners - Githubissues

efaulhaber commented 3 years ago

I think we all agree that GitHub actions are taking a long time. One fact that contributes to this is that we're running into the GitHub runner concurrency limit when multiple actions are running simultaneously. @ranocha mitigated this problem by cancelling previous actions of the same PR when a new commit triggers another action, but it still occurs when multiple actions are triggered by different PRs at the same time. We will surely add more tests in the future, and to prevent CI from running longer, we'd need to split them into even more parallel jobs than we already have, which will then block more GitHub runners at a time.

Our current limit of concurrent runners is 20. We could upgrade this to 40, 60 or even 180 by upgrading to a paid plan: grafik However, the prices are calculated on a per-user basis, so every new member of the organization would cost more. I'm not sure what features come with being a member of the trixi-framework organization. IIRC, non-members can't request reviews from the members, but I didn't notice other features (remember that our repositories are open source, members are surely important for private repos). Maybe it would be an option to not include students in the organization.

Another option that would maybe make more sense are self-hosted runners. I can see a few advantages of these compared to GitHub's runners:

Full control over hardware. We can also make runners with more than two cores, which will probably be necessary in the future when full MPI support is added to Trixi.
We don't have to install Julia and all packages from scratch in every single run. We don't have to run jobs in docker containers like GitHub's runners do, so we can preinstall the Julia versions we're interested in and install all necessary packages. I don't know how much of the time spent in the jobs is actually due to the setup of Julia, has anyone ever timed this?

Disadvantages I see:

Need to be hosted ourselves, duh!
Need to be updated regularly (especially Julia and packages).
Probably can't run macOS (and Windows?).

Anyway, this is supposed to be a discussion issue, so discuss!

ranocha commented 3 years ago

One aspect I would like to discuss is whether we really need as many tests on three OS as we have now. From my point of view, it would be great if we could figure out a minimal subset of tests that is considered to be "broad enough" to cover things that might go wrong when running on different OS. This could be a selected collection of 2D (1D?, 3D?) tests using different meshes and our binary dependencies (P4estMesh). If we can figure out such a subset that does not take too long to run, we could chop off quite a few CI jobs.

Xref https://github.com/trixi-framework/Trixi.jl/issues/372#issuecomment-888966824

sloede commented 3 years ago

@efaulhaber Thank you for this suggestion/discussion starter. At the moment, our organization is on a "Team" plan, so we should already have up to 60 concurrent jobs (please correct me if this is wrong).

I have thought about adding self-hosted runners multiple times myself. However, there are a few issues related to that, the biggest one being the non-negligible management overhead this creates for one or more people in the Trixi team. In addition, when not using Docker you'd probably need some other form of virtualization, as otherwise you can have only a single runner per machine (which is very inefficient). Also, when not using disposable containers, there might be security issues when having other people's (arbitrary) code run on a machine).

Having said this, I still think it would be a good and interesting option to have self-hosted runners. Maybe we can find a friendly university computing center who'd be interested in doing a pilot program with us to support research software development?

sloede commented 3 years ago

From my point of view, it would be great if we could figure out a minimal subset of tests that is considered to be "broad enough" to cover things that might go wrong when running on different OS.

I am absolutely open to this idea and think it's a good suggestion. To me, it's mostly a matter of developer time resources. One would have to (at least)

properly define this minimal subset
adapt the testing setup accordingly
adapt the CI setup accordingly
write some evaluation criteria for newly added tests to decide where they should go, and
update the docs

If someone is willing to put in the time for this (or if we collectively decide that we should make this a priority and all put in the time), I'm game. IMHO we should also consider if we can save additional time by running different sets of tests depending on whether a PR is marked as draft or not. E.g., only run Windows/macOS tests once a PR is marked ready for review to make draft PRs finish faster.

efaulhaber commented 3 years ago

At the moment, our organization is on a "Team" plan, so we should already have up to 60 concurrent jobs

Are you sure? I counted the running jobs when some were queued and I only counted 20.

ranocha commented 3 years ago

run Windows/macOS tests once a PR is marked ready for review to make draft PRs finish faster

Sounds like a good idea :+1:

sloede commented 3 years ago

At the moment, our organization is on a "Team" plan, so we should already have up to 60 concurrent jobs

Are you sure? I counted the running jobs when some were queued and I only counted 20.

When you encounter queued jobs again, could you please count again to make sure it's limited to 20? If yes, I will ask the GitHub support about it.

trixi-framework / Trixi.jl

About GitHub runners #753