Open hopeai opened 1 year ago
@hopeai Can you share more of the workflow that you are running even if it's mostly redacted? I going to try and reproduce the behavior you are seeing again.
Sure @dacbd,
I'm using the following Buildkite
pipeline to manually launch a gpu server when I need.
agents:
queue: "my-buildkite-agent"
steps:
- block: ":warning: Provision ec2 instance"
- command: "buildkite-agent pipeline upload .buildkite/cml_launch_runner.yml"
label: ":pipeline: Pipeline upload"
if: "build.branch == 'MYBRANCH'"
- wait
- command: "buildkite-agent pipeline upload .buildkite/trigger_my_buildkite_pipeline.yaml"
label: ":pipeline: Pipeline upload"
if: "build.branch == 'MYBRANCH'"
The two pipelines which previous pipeline uploads are as follows:
agents:
queue: "my-buildkite-agent"
steps:
label: ":gcloud: setup gcp creds" command: "./scripts/configure_gcloud.sh"
wait
label: ":gcloud: launch GCP Snapshot Creation Instance" command:
agents:
queue: "my-buildkite-agent"
steps:
The bash scripts which are used in the pipelines are as follows:
- ./scripts/configure_gcloud.sh
```bash
#!/bin/bash
set -eu -o pipefail
echo ":gcloud: Configure gcloud..."
export GOOGLE_APPLICATION_CREDENTIALS_DATA=$BUILDKITE_GS_APPLICATION_CREDENTIALS_JSON
echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json gcloud -q config set project MY-PROJECT gcloud -q auth configure-docker gcloud -v
echo "Done!"
- ./scripts/cml_launch_runner.sh
```bash
#!/bin/bash
set -eu -o pipefail
# Build a unique instance name
CML_RUNNER_NAME="some-name"
# Provision an GCP EC2 g4dn.xlage instance
cml runner launch \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=m+t4 \
--name="$CML_RUNNER_NAME" \
--labels=gpu-cml \
--token="$BUILDKITE_GITHUB_TOKEN" \
--repo=https://github.com/ORG/myrepo.git \
--cloud-hdd-size=60 \
--idle-timeout=120
#!/bin/bash
curl -L \ -X POST \ -H "Accept: application/vnd.github+json" \ -H "Authorization: Bearer $BUILDKITE_GITHUB_TOKEN" \ -H "X-GitHub-Api-Version: 2022-11-28" \ --url https://api.github.com/repos/intenseye/myrepo/actions/workflows/my_workflow.yml/dispatches \ -d '{"ref":"'"$BUILDKITE_BRANCH"'"}'
By running these pipelines everything works as expected and cml starts a gpu server on GCP and this self-hosted runner accepts my GitHub workflow jobs and jobs finish successfully. However, after idle-timeout GCP VM is not terminated. I checked the logs in self-hosted runner and shared here. I think this could help you to reproduce this behavior.
> ### Related to issue #834
> Facing this issue with cml v0.19.0
>
> I'm starting GCP instance using the following command inside a docker container which uses `iterativeai/cml:latest` image in a buildkite pipeline. I also share `GOOGLE_APPLICATION_CREDENTIALS_DATA` as environment variable, it starts and register self-hosted runner without any problem.
>
> I configured gcloud with:
>
> ```
> echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json
> gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json
> gcloud -q config set project PROJECT
> gcloud -q auth configure-docker
> ```
>
> run `cml runner launch` with
>
> ```
> cml runner launch \
> --cloud=gcp \
> --cloud-region=us-central1-a \
> --cloud-type=m+t4 \
> --name="$CML_RUNNER_NAME" \
> --labels=gcp-snapshot-creation \
> --token="$BUILDKITE_GITHUB_TOKEN" \
> --repo=https://github.com/ORG/REPO.git \
> --cloud-hdd-size=60 \
> --idle-timeout=120
> ```
>
> However, after idle-timeout passes GCP VM doesn't terminate. The output of `journalctl --unit cml --no-pager` is as follows:
>
> ```
> -- Logs begin at Fri 2023-06-09 10:33:16 UTC, end at Fri 2023-06-09 11:28:01 UTC. --
> Jun 09 10:38:43 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: Started cml.service.
> Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 287ms"}
> Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 234ms"}
> Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
> Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Preparing workdir /tmp/tmp.SpVQaKjpnL/.cml/cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
> Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Launching github runner"}
> Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Terraform 1.4.6"}
> Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
> Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
> Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Outputs: 0"}
> Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Connected to acpid service."}
> Jun 09 10:38:58 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 307ms"}
> Jun 09 10:39:01 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:01.672Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"ready"}
> Jun 09 10:39:20 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:20.270Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
> Jun 09 10:43:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:44.840Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
> Jun 09 10:43:47 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:47.170Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
> Jun 09 11:02:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T11:02:44.588Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
> Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Unregistering runner cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
> Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 403 in 159ms"}
> Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Retrying because of rate limit in 990 seconds"}
> Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 260ms"}
> Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"DELETE /repos/ORG/REPO/actions/runners/314 - 204 in 371ms"}
> Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"\tSuccess"}
> Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Waiting 10 seconds to destroy"}
> Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"runner status","reason":"timeout:120","status":"terminated"}
> Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: cml.service: Succeeded
> ```
>
> I am not facing any issue with aws ec2 instances and those terminate after idle-timeout.
The total runtime Is close to one hour which makes me think that your gcloud -q auth activate-service-accouth
method of authorization is generating an Oauth token which has expired by the time CML tries to shutdown.
I believe if you run the following query in GCP's "log explorer" you will NOT see a delete request (because it will be unauthorised from the expired credentials)
protoPayload.authenticationInfo.principalEmail="your_service_account_name@your_project_name.iam.gserviceaccount.com"
resource.type="gce_instance"
protoPayload.methodName="v1.compute.instances.delete"
Can you try authenticating your github actions via the following:
- name: 'Authenticate to Google Cloud'
uses: 'google-github-actions/auth@v0'
with:
credentials_json: ${{ secrets.GCP_CML_RUNNER_KEY }}
where secrets.GCP_CML_RUNNER_KEY
is the contents of the json key. ( I would remove the white space after and before the jsons opening/closing {}
's so that github actions doesn't try to obsfcate single characters like {
or }
)
you can also configure docker in a similar fashion:
- name: Login to GAR
uses: docker/login-action@v2
with:
registry: <location>-docker.pkg.dev
username: _json_key
password: ${{ secrets.GAR_JSON_KEY }}
Thanks a lot @dacbd for pointing this out. Let me try this
Related to issue #834
Facing this issue with cml v0.19.0
I'm starting GCP instance using the following command inside a docker container which uses
iterativeai/cml:latest
image in a buildkite pipeline. I also shareGOOGLE_APPLICATION_CREDENTIALS_DATA
as environment variable, it starts and register self-hosted runner without any problem.I configured gcloud with:
run
cml runner launch
withHowever, after idle-timeout passes GCP VM doesn't terminate. The output of
journalctl --unit cml --no-pager
is as follows:I am not facing any issue with aws ec2 instances and those terminate after idle-timeout.