stop not working when start was cancelled by concurrency cancelation rule

stas00 commented 3 years ago

To ensure that no more than one instance is running on several consequent pushes, the idea is to cancel the previous still running workflows on a new push, so we added a global setting:

concurrency: # cancel previous build on a new push
  group: ${{ github.ref }} # https://docs.github.com/en/actions/reference/context-and-expression-syntax-for-github-actions#github-context
  cancel-in-progress: true

now when I'm testing this, quite often stop fails to stop leaving the instance running, which is very expensive!

Run machulav/ec2-github-runner@v2
Error: Error: Not all the required inputs are provided for the 'stop' mode
Error: Not all the required inputs are provided for the 'stop' mode
Error: TypeError: Cannot read property 'mode' of undefined
Error: Cannot read property 'mode' of undefined

Here is the log from the start job:

 with:
    mode: start
    github-token: ***
    ec2-image-id: ami-03540b272db1624b7
    ec2-instance-type: p3.8xlarge
    security-group-id: sg-f2a4e2fc
    subnet-id: subnet-b7533b96
    aws-resource-tags: [
    {"Key": "Name", "Value": "ec2-github-runner"},
    {"Key": "GitHubRepository", "Value": "bigscience-workshop/Megatron-DeepSpeed"}
  ]

  env:
    AWS_DEFAULT_REGION: us-east-1
    AWS_REGION: us-east-1
    AWS_ACCESS_KEY_ID: ***
    AWS_SECRET_ACCESS_KEY: ***
GitHub Registration Token is received
AWS EC2 instance i-038eeed014c994b48 is started
Error: The operation was canceled.

so it looks like it didn't set the vars it was supposed to set because it was cancelled.

Here is the full workflow for context: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/7c636d7555e915f1f426984172f73840b2168313/.github/workflows/main.yml

If there are other solutions I'm all ears.

Thank you!

jpalomaki commented 3 years ago

If you are stopping (canceling) the runner start job before it gets a chance to finish, I don't think there's much the ec2-github-runner action could do to guard against that; it simply never gets to output the instance id and label that are needed for the runner stop job to work.

How about not using cancel-in-progress, but instead have the workflow runs queue up? Docs say: When a concurrent job or workflow is queued, if another job or workflow using the same concurrency group in the repository is in progress, the queued job or workflow will be pending. Any previously pending job or workflow in the concurrency group will be canceled

This means you'd only ever have one pending workflow run in the queue (I quickly tested this, and it seems to be the case).

Here's my test workflow (note how I don't use cancel-in-progress), try running it several times in succession:

name: Test

on:
  workflow_dispatch:

concurrency: lock

jobs:
  main:
    runs-on: ubuntu-20.04
    steps:
      - run: |
          echo "Running ..."
          sleep 30

jpalomaki commented 3 years ago

You can also explicitly skip workflow runs on push, by using commit messages: https://github.blog/changelog/2021-02-08-github-actions-skip-pull-request-and-push-workflows-with-skip-ci/.

Also, if you used the pull_request event, you could perhaps leverage draft PRs to conditionally skip workflows runs if github.event.pull_request.draft is true, until the PR is made ready for review.

stas00 commented 3 years ago

If you are stopping (canceling) the runner start job before it gets a chance to finish, I don't think there's much the ec2-github-runner action could do to guard against that; it simply never gets to output the instance id and label that are needed for the runner stop job to work.

That's because we have a race condition and nothing to guard against it.

Instead, there should be a sort of handshake protocol / or a failover self-stopping action / dead-man's switch.

So for example, the atomic failover protocol could be this:

- start action:
set a timeout flag on the ec2 side (user-data script?)
attempt to start
if started:
    set the instance id (which stop can use to stop)
    turn timeout flag off
else:
   do nothing

some time later - timeout on the ec2 side kicks in if it wasn't turned off and the instance shuts itself down

I don't quite know how to exactly implement this, but I hope that my proposal of "atomic action" can lead to some working solution.

Does github action have a sort of {{ always }} step that gets to run even if the workflow is cancelled? some sort of finalize.

stas00 commented 3 years ago

You can also explicitly skip workflow runs on push, by using commit messages: https://github.blog/changelog/2021-02-08-github-actions-skip-pull-request-and-push-workflows-with-skip-ci/.

Also, if you used the pull_request event, you could perhaps leverage draft PRs to conditionally skip workflows runs if github.event.pull_request.draft is true, until the PR is made ready for review.

All these suggestions are great for a handful of core devs, but we have contributors with a hugely varying level of expertise and it's already at times too much to even ask for a PR. So these won't work well.

stas00 commented 3 years ago

Thank you for showering me with all the different ideas, @jpalomaki

When a concurrent job or workflow is queued, if another job or workflow using the same concurrency group in the repository is in progress, the queued job or workflow will be pending. Any previously pending job or workflow in the concurrency group will be canceled.

yes, this may be good enough, other than wasting resources of continuing running the current job even if know that its outcome is irrelevant.

concurrency: lock

where have you found lock? I haven't seen this documented. Unless you're just using a random consistent string, which simply indicates that any PR of this repo should be lined up for a single resource. i.e. only a single instance ever runs across all PRs.

Ideally we want to prevent concurrent runs within the same PR. But other PRs should be totally independent and allowed to run their CI when needed.

Though it might be OK as well.

Currently it takes a really long time to run even a basic test due to many compilations that have to happen before the tests can run. I need to think how to minimize that overhead.

jpalomaki commented 3 years ago

Instead, there should be a sort of handshake protocol / or a failover self-stopping action / dead-man's switch.

Nice idea :+1: Not sure how complex/brittle the implementation would get, though.

Does github action have a sort of {{ always }} step that gets to run even if the workflow is cancelled? some sort of finalize.

Yep, the stop job is in fact annotated with that (see the if condition), but the problem is that the input to that job is now lacking (because the start job was cut short).

All these suggestions are great for a handful of core devs, but we have contributors with a hugely varying level of expertise and it's already at times too much to even ask for a PR. So these won't work well.

Understood. One option could then be to leverage PR comments, e.g. verbatim /test to trigger tests at a suitable spot. You can use the issue_comment event (comparing the comment text) to trigger a workflow when a PR comment is added. Or perhaps there are some ready-made github apps/integrations that could help with this?

where have you found lock?

That was just an example, static concurrency group name.

Ideally we want to prevent concurrent runs within the same PR. But other PRs should be totally independent and allowed to run their CI when needed.

All right, looks like you are in fact using ${{ github.ref }} for the concurrency group already :+1:

stas00 commented 3 years ago

Instead, there should be a sort of handshake protocol / or a failover self-stopping action / dead-man's switch.

Nice idea +1 Not sure how complex/brittle the implementation would get, though.

Won't actually this be as simple as doing:

  start-runner:
[...]
    steps:
      - run: perl -e 'sleep 600; qx[shutdown now]' & # start dead-man's switch
      - <normal start action here> 
      - run: killall -9 perl # cancel dead-man's switch

of course this is just a rough prototype to show the concept, and relies on not having any other perl programs running, but this is good enough to show the idea.

update: duh! there is no instance yet to run the switch on on the user side ;) so this would need to be called in user-data script - but perhaps this can be done by the action and hide the complexity from the user.

jpalomaki commented 3 years ago

Looks like @machulav has already done some work on this topic: https://github.com/machulav/ec2-github-runner/issues/7

stas00 commented 3 years ago

The linked discussion alludes that some timeout work will be done, but I can't find any PRs that actually implement it. Not sure why that issue was closed then, since we can see the race condition is still there.

stas00 commented 3 years ago

If I'm not mistaken the timeout issue is https://github.com/machulav/ec2-github-runner/issues/4 and it's still unresolved.

jpalomaki commented 3 years ago

Would it help if you used a combination of workflow-level concurrency control and job-level concurrency control? Then you could use cancel-in-progress for the test job, which I presume would be safe (since it would only cut the test job short), while still being able to limit concurrent workflow runs to just one as well?

stas00 commented 3 years ago

That is a great idea, @jpalomaki! I didn't know it was possible to combine both in the same workflow.

mgaitan commented 1 year ago

what I'm doing is to setup a croned self-termination of the instance using aws cli

 pre-runner-script: |
       aws ec2 terminate-instances --instance-ids $(curl -s http://169.254.169.254/latest/meta-data/instance-id) | at now +${{ inputs.timeout-minutes }} minutes

machulav / ec2-github-runner

stop not working when start was cancelled by concurrency cancelation rule #69