iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
4.03k stars 341 forks source link

GCP VM instance does not terminate after idle timeout #1386

Open hopeai opened 1 year ago

hopeai commented 1 year ago

Related to issue #834

Facing this issue with cml v0.19.0

I'm starting GCP instance using the following command inside a docker container which uses iterativeai/cml:latest image in a buildkite pipeline. I also share GOOGLE_APPLICATION_CREDENTIALS_DATA as environment variable, it starts and register self-hosted runner without any problem.

I configured gcloud with:

echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json
gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json
gcloud -q config set project PROJECT
gcloud -q auth configure-docker

run cml runner launch with

cml runner launch \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=m+t4    \
--name="$CML_RUNNER_NAME" \
--labels=gcp-snapshot-creation \
--token="$BUILDKITE_GITHUB_TOKEN" \
--repo=https://github.com/ORG/REPO.git \
--cloud-hdd-size=60 \
--idle-timeout=120

However, after idle-timeout passes GCP VM doesn't terminate. The output of journalctl --unit cml --no-pager is as follows:

-- Logs begin at Fri 2023-06-09 10:33:16 UTC, end at Fri 2023-06-09 11:28:01 UTC. --
Jun 09 10:38:43 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: Started cml.service.
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 287ms"}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 234ms"}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Preparing workdir /tmp/tmp.SpVQaKjpnL/.cml/cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Launching github runner"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Terraform 1.4.6"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Outputs: 0"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Connected to acpid service."}
Jun 09 10:38:58 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 307ms"}
Jun 09 10:39:01 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:01.672Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"ready"}
Jun 09 10:39:20 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:20.270Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
Jun 09 10:43:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:44.840Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
Jun 09 10:43:47 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:47.170Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
Jun 09 11:02:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T11:02:44.588Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Unregistering runner cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 403 in 159ms"}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Retrying because of rate limit in 990 seconds"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 260ms"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"DELETE /repos/ORG/REPO/actions/runners/314 - 204 in 371ms"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"\tSuccess"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Waiting 10 seconds to destroy"}
Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"runner status","reason":"timeout:120","status":"terminated"}
Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: cml.service: Succeeded

I am not facing any issue with aws ec2 instances and those terminate after idle-timeout.

dacbd commented 1 year ago

@hopeai Can you share more of the workflow that you are running even if it's mostly redacted? I going to try and reproduce the behavior you are seeing again.

hopeai commented 1 year ago

Sure @dacbd,

I'm using the following Buildkite pipeline to manually launch a gpu server when I need.

agents:
  queue: "my-buildkite-agent"
steps:
  - block: ":warning: Provision ec2 instance"
  - command: "buildkite-agent pipeline upload .buildkite/cml_launch_runner.yml"
    label: ":pipeline: Pipeline upload"
    if: "build.branch == 'MYBRANCH'"

  - wait

  - command: "buildkite-agent pipeline upload .buildkite/trigger_my_buildkite_pipeline.yaml"
    label: ":pipeline: Pipeline upload"
    if: "build.branch == 'MYBRANCH'"

The two pipelines which previous pipeline uploads are as follows:

steps:

steps:

set -eu -o pipefail

echo ":gcloud: Configure gcloud..."

export GOOGLE_APPLICATION_CREDENTIALS_DATA=$BUILDKITE_GS_APPLICATION_CREDENTIALS_JSON

echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json gcloud -q config set project MY-PROJECT gcloud -q auth configure-docker gcloud -v

echo "Done!"

- ./scripts/cml_launch_runner.sh
```bash
#!/bin/bash

set -eu -o pipefail

# Build a unique instance name
CML_RUNNER_NAME="some-name"

# Provision an GCP EC2 g4dn.xlage instance
cml runner launch \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=m+t4    \
--name="$CML_RUNNER_NAME" \
--labels=gpu-cml \
--token="$BUILDKITE_GITHUB_TOKEN" \
--repo=https://github.com/ORG/myrepo.git \
--cloud-hdd-size=60 \
--idle-timeout=120

curl -L \ -X POST \ -H "Accept: application/vnd.github+json" \ -H "Authorization: Bearer $BUILDKITE_GITHUB_TOKEN" \ -H "X-GitHub-Api-Version: 2022-11-28" \ --url https://api.github.com/repos/intenseye/myrepo/actions/workflows/my_workflow.yml/dispatches \ -d '{"ref":"'"$BUILDKITE_BRANCH"'"}'


By running these pipelines everything works as expected and cml starts a gpu server on GCP and this self-hosted runner accepts my GitHub workflow jobs and jobs finish successfully. However, after idle-timeout GCP VM is not terminated. I checked the logs in self-hosted runner and shared here. I think this could help you to reproduce this behavior. 

> ### Related to issue #834
> Facing this issue with cml v0.19.0
> 
> I'm starting GCP instance using the following command inside a docker container which uses `iterativeai/cml:latest` image in a buildkite pipeline. I also share `GOOGLE_APPLICATION_CREDENTIALS_DATA` as environment variable, it starts and register self-hosted runner without any problem.
> 
> I configured gcloud with:
> 
> ```
> echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json
> gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json
> gcloud -q config set project PROJECT
> gcloud -q auth configure-docker
> ```
> 
> run `cml runner launch` with
> 
> ```
> cml runner launch \
> --cloud=gcp \
> --cloud-region=us-central1-a \
> --cloud-type=m+t4  \
> --name="$CML_RUNNER_NAME" \
> --labels=gcp-snapshot-creation \
> --token="$BUILDKITE_GITHUB_TOKEN" \
> --repo=https://github.com/ORG/REPO.git \
> --cloud-hdd-size=60 \
> --idle-timeout=120
> ```
> 
> However, after idle-timeout passes GCP VM doesn't terminate. The output of `journalctl --unit cml --no-pager` is as follows:
> 
> ```
> -- Logs begin at Fri 2023-06-09 10:33:16 UTC, end at Fri 2023-06-09 11:28:01 UTC. --
> Jun 09 10:38:43 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: Started cml.service.
> Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 287ms"}
> Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 234ms"}
> Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
> Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Preparing workdir /tmp/tmp.SpVQaKjpnL/.cml/cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
> Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Launching github runner"}
> Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Terraform 1.4.6"}
> Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
> Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
> Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Outputs: 0"}
> Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Connected to acpid service."}
> Jun 09 10:38:58 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 307ms"}
> Jun 09 10:39:01 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:01.672Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"ready"}
> Jun 09 10:39:20 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:20.270Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
> Jun 09 10:43:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:44.840Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
> Jun 09 10:43:47 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:47.170Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
> Jun 09 11:02:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T11:02:44.588Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
> Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Unregistering runner cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
> Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 403 in 159ms"}
> Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Retrying because of rate limit in 990 seconds"}
> Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 260ms"}
> Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"DELETE /repos/ORG/REPO/actions/runners/314 - 204 in 371ms"}
> Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"\tSuccess"}
> Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Waiting 10 seconds to destroy"}
> Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"runner status","reason":"timeout:120","status":"terminated"}
> Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: cml.service: Succeeded
> ```
> 
> I am not facing any issue with aws ec2 instances and those terminate after idle-timeout.
dacbd commented 1 year ago

The total runtime Is close to one hour which makes me think that your gcloud -q auth activate-service-accouth method of authorization is generating an Oauth token which has expired by the time CML tries to shutdown.

I believe if you run the following query in GCP's "log explorer" you will NOT see a delete request (because it will be unauthorised from the expired credentials)

protoPayload.authenticationInfo.principalEmail="your_service_account_name@your_project_name.iam.gserviceaccount.com"
resource.type="gce_instance"
protoPayload.methodName="v1.compute.instances.delete"

Can you try authenticating your github actions via the following:

      - name: 'Authenticate to Google Cloud'
        uses: 'google-github-actions/auth@v0'
        with:
          credentials_json: ${{ secrets.GCP_CML_RUNNER_KEY }}

where secrets.GCP_CML_RUNNER_KEY is the contents of the json key. ( I would remove the white space after and before the jsons opening/closing {}'s so that github actions doesn't try to obsfcate single characters like { or })

you can also configure docker in a similar fashion:

      - name: Login to GAR
        uses: docker/login-action@v2
        with:
          registry: <location>-docker.pkg.dev
          username: _json_key
          password: ${{ secrets.GAR_JSON_KEY }}
hopeai commented 1 year ago

Thanks a lot @dacbd for pointing this out. Let me try this