Open stas00 opened 3 years ago
If you are stopping (canceling) the runner start job before it gets a chance to finish, I don't think there's much the ec2-github-runner
action could do to guard against that; it simply never gets to output the instance id and label that are needed for the runner stop job to work.
How about not using cancel-in-progress
, but instead have the workflow runs queue up? Docs say: When a concurrent job or workflow is queued, if another job or workflow using the same concurrency group in the repository is in progress, the queued job or workflow will be pending. Any previously pending job or workflow in the concurrency group will be canceled
This means you'd only ever have one pending workflow run in the queue (I quickly tested this, and it seems to be the case).
Here's my test workflow (note how I don't use cancel-in-progress
), try running it several times in succession:
name: Test
on:
workflow_dispatch:
concurrency: lock
jobs:
main:
runs-on: ubuntu-20.04
steps:
- run: |
echo "Running ..."
sleep 30
You can also explicitly skip workflow runs on push, by using commit messages: https://github.blog/changelog/2021-02-08-github-actions-skip-pull-request-and-push-workflows-with-skip-ci/.
Also, if you used the pull_request
event, you could perhaps leverage draft PRs to conditionally skip workflows runs if github.event.pull_request.draft
is true
, until the PR is made ready for review.
If you are stopping (canceling) the runner start job before it gets a chance to finish, I don't think there's much the
ec2-github-runner
action could do to guard against that; it simply never gets to output the instance id and label that are needed for the runner stop job to work.
That's because we have a race condition and nothing to guard against it.
Instead, there should be a sort of handshake protocol / or a failover self-stopping action / dead-man's switch.
So for example, the atomic failover protocol could be this:
- start action:
set a timeout flag on the ec2 side (user-data script?)
attempt to start
if started:
set the instance id (which stop can use to stop)
turn timeout flag off
else:
do nothing
some time later - timeout on the ec2 side kicks in if it wasn't turned off and the instance shuts itself down
I don't quite know how to exactly implement this, but I hope that my proposal of "atomic action" can lead to some working solution.
Does github action have a sort of {{ always }}
step that gets to run even if the workflow is cancelled? some sort of finalize.
You can also explicitly skip workflow runs on push, by using commit messages: https://github.blog/changelog/2021-02-08-github-actions-skip-pull-request-and-push-workflows-with-skip-ci/.
Also, if you used the
pull_request
event, you could perhaps leverage draft PRs to conditionally skip workflows runs ifgithub.event.pull_request.draft
istrue
, until the PR is made ready for review.
All these suggestions are great for a handful of core devs, but we have contributors with a hugely varying level of expertise and it's already at times too much to even ask for a PR. So these won't work well.
Thank you for showering me with all the different ideas, @jpalomaki
When a concurrent job or workflow is queued, if another job or workflow using the same concurrency group in the repository is in progress, the queued job or workflow will be pending. Any previously pending job or workflow in the concurrency group will be canceled.
yes, this may be good enough, other than wasting resources of continuing running the current job even if know that its outcome is irrelevant.
concurrency: lock
where have you found lock
? I haven't seen this documented. Unless you're just using a random consistent string, which simply indicates that any PR of this repo should be lined up for a single resource. i.e. only a single instance ever runs across all PRs.
Ideally we want to prevent concurrent runs within the same PR. But other PRs should be totally independent and allowed to run their CI when needed.
Though it might be OK as well.
Currently it takes a really long time to run even a basic test due to many compilations that have to happen before the tests can run. I need to think how to minimize that overhead.
Instead, there should be a sort of handshake protocol / or a failover self-stopping action / dead-man's switch.
Nice idea :+1: Not sure how complex/brittle the implementation would get, though.
Does github action have a sort of {{ always }} step that gets to run even if the workflow is cancelled? some sort of finalize.
Yep, the stop job is in fact annotated with that (see the if
condition), but the problem is that the input to that job is now lacking (because the start job was cut short).
All these suggestions are great for a handful of core devs, but we have contributors with a hugely varying level of expertise and it's already at times too much to even ask for a PR. So these won't work well.
Understood. One option could then be to leverage PR comments, e.g. verbatim /test
to trigger tests at a suitable spot. You can use the issue_comment
event (comparing the comment text) to trigger a workflow when a PR comment is added. Or perhaps there are some ready-made github apps/integrations that could help with this?
where have you found
lock
?
That was just an example, static concurrency group name.
Ideally we want to prevent concurrent runs within the same PR. But other PRs should be totally independent and allowed to run their CI when needed.
All right, looks like you are in fact using ${{ github.ref }}
for the concurrency group already :+1:
Instead, there should be a sort of handshake protocol / or a failover self-stopping action / dead-man's switch.
Nice idea +1 Not sure how complex/brittle the implementation would get, though.
Won't actually this be as simple as doing:
start-runner:
[...]
steps:
- run: perl -e 'sleep 600; qx[shutdown now]' & # start dead-man's switch
- <normal start action here>
- run: killall -9 perl # cancel dead-man's switch
of course this is just a rough prototype to show the concept, and relies on not having any other perl programs running, but this is good enough to show the idea.
update: duh! there is no instance yet to run the switch on on the user side ;) so this would need to be called in user-data script - but perhaps this can be done by the action and hide the complexity from the user.
Looks like @machulav has already done some work on this topic: https://github.com/machulav/ec2-github-runner/issues/7
The linked discussion alludes that some timeout work will be done, but I can't find any PRs that actually implement it. Not sure why that issue was closed then, since we can see the race condition is still there.
If I'm not mistaken the timeout issue is https://github.com/machulav/ec2-github-runner/issues/4 and it's still unresolved.
Would it help if you used a combination of workflow-level concurrency control and job-level concurrency control? Then you could use cancel-in-progress
for the test job, which I presume would be safe (since it would only cut the test job short), while still being able to limit concurrent workflow runs to just one as well?
That is a great idea, @jpalomaki! I didn't know it was possible to combine both in the same workflow.
what I'm doing is to setup a croned self-termination of the instance using aws cli
pre-runner-script: |
aws ec2 terminate-instances --instance-ids $(curl -s http://169.254.169.254/latest/meta-data/instance-id) | at now +${{ inputs.timeout-minutes }} minutes
To ensure that no more than one instance is running on several consequent pushes, the idea is to cancel the previous still running workflows on a new push, so we added a global setting:
now when I'm testing this, quite often stop fails to stop leaving the instance running, which is very expensive!
Here is the log from the start job:
so it looks like it didn't set the vars it was supposed to set because it was cancelled.
Here is the full workflow for context: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/7c636d7555e915f1f426984172f73840b2168313/.github/workflows/main.yml
If there are other solutions I'm all ears.
Thank you!