iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
4.04k stars 339 forks source link

GCP VM instance not terminating after timeout #834

Closed eemelipa closed 2 years ago

eemelipa commented 2 years ago

Similar issue to https://github.com/iterative/cml/issues/678

I'm starting a self hosted runner via Gitlab CICD to GCP:

deploy-runner:
  stage: start runner
  image: iterativeai/cml:0-dvc2-base1
  resource_group: all
  script:
    - cml runner --cloud=gcp --cloud-region=eu-north --cloud-type=c2-standard-4 --labels=cml-runner --reuse --idle-timeout 600

After the timeout the VM instance is not shutting down.

journalctl --unit cml --no-pager command shows

-- Logs begin at Sat 2021-12-04 18:40:23 UTC, end at Sat 2021-12-04 19:02:12 UTC. --
Dec 04 18:43:29 cml-4ejd6b8lzc systemd[1]: Started cml.service.
Dec 04 18:43:37 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Preparing workdir /tmp/tmp.rtgpwktKf5/.cml/cml-4ejd6b8lzc..."}
Dec 04 18:43:37 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Launching gitlab runner"}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"level":"warn","message":"SpotNotifier can not be started."}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:41.453Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:41.454Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"ready"}
Dec 04 18:43:42 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:42.276Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"job_started"}
Dec 04 18:44:38 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:44:38.904Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:45:46 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:45:46.363Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:45:49 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:45:46.363Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.303Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"job_ended","success":false}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.649Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.952Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.953Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:56:28 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"runner status","reason":"timeout:600","status":"terminated"}
Dec 04 18:56:28 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Dec 04 18:56:48 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Unregistering runner cml-4ejd6b8lzc..."}
Dec 04 18:56:49 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"\tSuccess"}
Dec 04 18:56:50 cml-4ejd6b8lzc systemd[1]: cml.service: Succeeded.

The runner picks up a job correctly and the runner deregisters itself from the Gitlab project. The VM instance just does not shutdown.

On Azure similar config worked ok and the instances were shutting down

eemelipa commented 2 years ago

Tried in different region and instance type: --cloud-region=europe-west1-b --cloud-type=n1-standard-1 but that did not help

DavidGOrtega commented 2 years ago

@eemelipa thanks a lot for the issue.

dacbd commented 2 years ago

hmm 🤔 the opposite of mine https://github.com/iterative/cml/issues/808

@eemelipa can you post a screenshot for me? After you run your pipeline go to the GCP dashboard for the project then the Activity tab and take a screenshot?

If it's in a busy project that has other stuff, search for the "Logs Explorer" and narrow the time to when your pipeline ran, and look for any "severity" high than "notice"

eemelipa commented 2 years ago

image

@dacbd here's the screenshot ^ There weren't any warn/error logs 🤔 The bootCounter: 2 latebootreportevent seems to happen correctly at 5min mark which was the idle-timeout in this case. I don't know if the logs are missing any rows though

The credentials have "Compute Admin" privilege

dacbd commented 2 years ago

@eemelipa there should be some firewall/network API calls. When I had the teardowns fail these calls failed and so the instance remained.

eemelipa commented 2 years ago

@eemelipa there should be some firewall/network API calls. When I had the teardowns fail these calls failed and so the instance remained.

Ok, I adjusted the log filtering and now the firewall inserts are visible:

image

No errors/warnings though. Any thoughts on what to look for next? I tried giving the service account owner access to the project (i.e., all privileges) but that didn't help

dacbd commented 2 years ago

An error here would have made this an easy fix 😞

There is something more complicated going on, "Compute Admin" is a sufficient role.

dacbd commented 2 years ago

I'm not sure how much help I can provide without digging into the setup more, or building a custom cml image with more debug logging for the terraform provider

Here you can see a full successful logging lifecycle in gcp image

dacbd commented 2 years ago

We might be able to get more info out of journalctl by adding a start-up script that creates a custom environment over-ride for the cml.service /etc/systemd/system/cml.service.d/debug.conf

I think something like this should work: debug.conf

[Service]
Environment="TF_LOG=DEBUG"
eemelipa commented 2 years ago

Spend good time fiddling manually on the VM instance and got a step forward

Looks like the problem is that the CML VM instance did not get any GCP service account:

image

When I put that in place manually and restarted the instance things worked! The instance got deleted after the idle-timeout.

Sounds like something should fail if the instance does not have a service account. Here's couple options that came to my mind (obviously you guys know better what's under the hood):

  1. The iterative terraform provider should fail when fetching credentials in case service account is not in place (maybe somewhere here https://github.com/iterative/terraform-provider-iterative/blob/550c698677b94644a80fa019fab6f62b26787543/iterative/gcp/provider.go#L295)
  2. The GCP service account is specified with --cloud-permission-set cml cli argument. If the --cloud=gcp but the service account is missing then the cml runner command should fail

So missing service account is one problem and we seem to also have a second problem. When I give service account to cml runner command it creates the VM instance correctly with the account but it does not give correct Cloud API access scopes:

image

When I gave manually the API scopes the idle-timeout deletion worked ok. Sounds like some changes might be needed to the CML instance creation

dacbd commented 2 years ago

Hmm, it sounds like some documentation clarification might be required?

Under the hood, cml runner adds the GOOGLE_APPLICATION_CREDENTIALS_DATA that cml was invoked with into the systemd service unit as those should be the credentials used for the creation of the instance and thus also should be used for the teardown of the instance.

The --cloud-permission-set takes (in GCP's case) the service account email to attach to the instance, the intent behind that is for the application or ML model to use to access other services from the cloud provider like s3/object storage.

Are you saying it looks like terraform tried to use those (the --cloud-permission-set) creds instead of the original cml runner ones? That is definitely not intended.

This should be easy for me to reproduce and I'll try to get it fixed soon, if you are on discord and willing to test out a patch I can let you know when I have something working (dabarnes on discord)

dacbd commented 2 years ago

Can you share more of the yml from gitlab-ci?

I think I misunderstood some of your last reply.

Can you share the permissions list or GCP managed role for the service account whose key should be set in the GOOGLE_APPLICATION_CREDENTIALS_DATA environment variable when you invoke your cml runner command and the same for the one you are trying to set with the --cloud-permission-set ?

Also, cat out the /etc/systemd/system/cml.service file and be sure to check and redact any sensitive values that might be there

edit: In the same screen as your first screenshot there's a metadata section. Can you verify GOOGLE_APPLICATION_CREDENTIALS_DATA json is your service account you invoked cml runner with, in the startup-script?

dacbd commented 2 years ago

Can be closed with TPI fix: https://github.com/iterative/terraform-provider-iterative/pull/333

casperdcl commented 2 years ago

fixed by https://github.com/iterative/terraform-provider-iterative/pull/333; thanks @dacbd