iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
4k stars 339 forks source link

GCP cloud runner not terminating #678

Closed lemontheme closed 2 years ago

lemontheme commented 3 years ago

This is a repeat of #661, which was supposedly fixed in #653. Unfortunately, I'm not seeing any changes in the shutdown behavior of my GCP compute instances. That is, they keep running past the timeout interval.

I'm using the same workflow as before (in #661):

name: 'Train-in-the-cloud-GCP'
on: 
  workflow_dispatch:

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: 'Deploy runner on GCP'
        shell: bash
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          # Notice use of `GOOGLE_APPLICATION_CREDENTIALS_DATA` instead of
          # `GOOGLE_APPLICATION_CREDENTIALS`. Contrary to what docs suggest, the
          # latter causes problems for terraform.
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
        run: |
          cml-runner \
          --cloud gcp \
          --cloud-region europe-west1-b  \
          --cloud-type=n1-standard-1 \
          --labels=cml-runner

  model-training:
    needs: deploy-runner
    runs-on: [self-hosted, cml-runner]
    container: docker://dvcorg/cml-py3:latest
    steps:
      - uses: actions/checkout@v2
      - name: 'Train my dummy model'
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        run: |
          echo "Training a super awesome model"
          sleep 5
          echo "Training complete"

Anyway, this seems to contradict the tests, as @DavidGOrtega explains in the comments under #653:

[...] tests with TPI indicates that the instances are disposed after the expected time.

Any idea what I might be doing wrong?

DavidGOrtega commented 3 years ago

i can say again that GCP is terminating as expected. Can you please get into the machine and run and let me know what says?

journalctl --unit cml --no-pager

@lemontheme

lemontheme commented 3 years ago

Sure thing. Here's what I get:

-- Logs begin at Thu 2021-07-29 14:22:56 UTC, end at Thu 2021-07-29 14:38:42 UTC. --
Jul 29 14:26:39 cml-36s36ywc7z systemd[1]: Started cml.service.
Jul 29 14:26:46 cml-36s36ywc7z cml.sh[17975]: Preparing workdir /tmp/tmp.b7BwstF7kJ/.cml/cml-07toknujbd...
Jul 29 14:26:46 cml-36s36ywc7z cml.sh[17975]: Launching github runner
Jul 29 14:27:10 cml-36s36ywc7z cml.sh[17975]: SpotNotifier can not be started.
Jul 29 14:27:11 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:11.452Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","message":""}
Jul 29 14:27:11 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:11.453Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","message":"√ "}
Jul 29 14:27:11 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:11.454Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","message":"Connected to Git
Hub"}
Jul 29 14:27:11 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:11.454Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","message":""}
Jul 29 14:27:11 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:11.995Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","status":"ready","message":
"Listening for Jobs"}
Jul 29 14:27:22 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:27:22.333Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","job":3192860335,"status":"
job_started","message":"Running job: model-training"}
Jul 29 14:34:29 cml-36s36ywc7z cml.sh[17975]: {"level":"info","date":"2021-07-29T14:34:29.721Z","repo":"https://github.com/lemontheme/mlops-with-gh-actions","job":"","status":"job_ende
d","success":true,"message":"Job model-training completed with result: Succeeded"}
DavidGOrtega commented 3 years ago

And stills? AS far as I can see the timeout is not happening... What a weird thing

DavidGOrtega commented 3 years ago

I can see that the runner is terminating properly itself with idle-time however when I destroy it using the terraform provider, sometimes GCP does not send the graceful shutdown

image

However this does reflect the issue here where seems that the chrono might be not working

dacbd commented 2 years ago

@lemontheme I believe this issue is resolved, can you confirm your workflow is functional without any workarounds?

lemontheme commented 2 years ago

Hi @dacbd, sorry to keep you waiting. Been a while since I looked at this.

Anyway, I'm happy to confirm that instances are now indeed stopped and deleted as expected! :) That's using the exact same workflow as above. Great to see you've made progress with this. Thanks!

0x2b3bfa0 commented 2 years ago

Thank you very much, @dacbd for the fix and @lemontheme for confirming the resolution!