trixi-framework / Trixi.jl

Trixi.jl: Adaptive high-order numerical simulations of conservation laws in Julia
https://trixi-framework.github.io/Trixi.jl
MIT License
536 stars 109 forks source link

About GitHub runners #753

Open efaulhaber opened 3 years ago

efaulhaber commented 3 years ago

I think we all agree that GitHub actions are taking a long time. One fact that contributes to this is that we're running into the GitHub runner concurrency limit when multiple actions are running simultaneously. @ranocha mitigated this problem by cancelling previous actions of the same PR when a new commit triggers another action, but it still occurs when multiple actions are triggered by different PRs at the same time. We will surely add more tests in the future, and to prevent CI from running longer, we'd need to split them into even more parallel jobs than we already have, which will then block more GitHub runners at a time.

Our current limit of concurrent runners is 20. We could upgrade this to 40, 60 or even 180 by upgrading to a paid plan: grafik However, the prices are calculated on a per-user basis, so every new member of the organization would cost more. I'm not sure what features come with being a member of the trixi-framework organization. IIRC, non-members can't request reviews from the members, but I didn't notice other features (remember that our repositories are open source, members are surely important for private repos). Maybe it would be an option to not include students in the organization.

Another option that would maybe make more sense are self-hosted runners. I can see a few advantages of these compared to GitHub's runners:

Disadvantages I see:

Anyway, this is supposed to be a discussion issue, so discuss!

ranocha commented 3 years ago

One aspect I would like to discuss is whether we really need as many tests on three OS as we have now. From my point of view, it would be great if we could figure out a minimal subset of tests that is considered to be "broad enough" to cover things that might go wrong when running on different OS. This could be a selected collection of 2D (1D?, 3D?) tests using different meshes and our binary dependencies (P4estMesh). If we can figure out such a subset that does not take too long to run, we could chop off quite a few CI jobs.

Xref https://github.com/trixi-framework/Trixi.jl/issues/372#issuecomment-888966824

sloede commented 3 years ago

@efaulhaber Thank you for this suggestion/discussion starter. At the moment, our organization is on a "Team" plan, so we should already have up to 60 concurrent jobs (please correct me if this is wrong).

I have thought about adding self-hosted runners multiple times myself. However, there are a few issues related to that, the biggest one being the non-negligible management overhead this creates for one or more people in the Trixi team. In addition, when not using Docker you'd probably need some other form of virtualization, as otherwise you can have only a single runner per machine (which is very inefficient). Also, when not using disposable containers, there might be security issues when having other people's (arbitrary) code run on a machine).

Having said this, I still think it would be a good and interesting option to have self-hosted runners. Maybe we can find a friendly university computing center who'd be interested in doing a pilot program with us to support research software development?

sloede commented 3 years ago

From my point of view, it would be great if we could figure out a minimal subset of tests that is considered to be "broad enough" to cover things that might go wrong when running on different OS.

I am absolutely open to this idea and think it's a good suggestion. To me, it's mostly a matter of developer time resources. One would have to (at least)

If someone is willing to put in the time for this (or if we collectively decide that we should make this a priority and all put in the time), I'm game. IMHO we should also consider if we can save additional time by running different sets of tests depending on whether a PR is marked as draft or not. E.g., only run Windows/macOS tests once a PR is marked ready for review to make draft PRs finish faster.

efaulhaber commented 3 years ago

At the moment, our organization is on a "Team" plan, so we should already have up to 60 concurrent jobs

Are you sure? I counted the running jobs when some were queued and I only counted 20.

ranocha commented 3 years ago

run Windows/macOS tests once a PR is marked ready for review to make draft PRs finish faster

Sounds like a good idea :+1:

sloede commented 3 years ago

At the moment, our organization is on a "Team" plan, so we should already have up to 60 concurrent jobs

Are you sure? I counted the running jobs when some were queued and I only counted 20.

image

When you encounter queued jobs again, could you please count again to make sure it's limited to 20? If yes, I will ask the GitHub support about it.