Closed eemelipa closed 2 years ago
Tried in different region and instance type: --cloud-region=europe-west1-b --cloud-type=n1-standard-1
but that did not help
@eemelipa thanks a lot for the issue.
hmm 🤔 the opposite of mine https://github.com/iterative/cml/issues/808
@eemelipa can you post a screenshot for me? After you run your pipeline go to the GCP dashboard for the project then the Activity tab and take a screenshot?
If it's in a busy project that has other stuff, search for the "Logs Explorer" and narrow the time to when your pipeline ran, and look for any "severity" high than "notice"
@dacbd here's the screenshot ^ There weren't any warn/error logs 🤔 The bootCounter: 2 latebootreportevent seems to happen correctly at 5min mark which was the idle-timeout in this case. I don't know if the logs are missing any rows though
The credentials have "Compute Admin" privilege
@eemelipa there should be some firewall/network API calls. When I had the teardowns fail these calls failed and so the instance remained.
@eemelipa there should be some firewall/network API calls. When I had the teardowns fail these calls failed and so the instance remained.
Ok, I adjusted the log filtering and now the firewall inserts are visible:
No errors/warnings though. Any thoughts on what to look for next? I tried giving the service account owner access to the project (i.e., all privileges) but that didn't help
An error here would have made this an easy fix 😞
There is something more complicated going on, "Compute Admin" is a sufficient role.
I'm not sure how much help I can provide without digging into the setup more, or building a custom cml image with more debug logging for the terraform provider
Here you can see a full successful logging lifecycle in gcp
We might be able to get more info out of journalctl
by adding a start-up script that creates a custom environment over-ride for the cml.service /etc/systemd/system/cml.service.d/debug.conf
I think something like this should work: debug.conf
[Service]
Environment="TF_LOG=DEBUG"
Spend good time fiddling manually on the VM instance and got a step forward
Looks like the problem is that the CML VM instance did not get any GCP service account:
When I put that in place manually and restarted the instance things worked! The instance got deleted after the idle-timeout.
Sounds like something should fail if the instance does not have a service account. Here's couple options that came to my mind (obviously you guys know better what's under the hood):
--cloud-permission-set
cml cli argument. If the --cloud=gcp
but the service account is missing then the cml runner
command should failSo missing service account is one problem and we seem to also have a second problem. When I give service account to cml runner command it creates the VM instance correctly with the account but it does not give correct Cloud API access scopes:
When I gave manually the API scopes the idle-timeout deletion worked ok. Sounds like some changes might be needed to the CML instance creation
Hmm, it sounds like some documentation clarification might be required?
Under the hood, cml runner
adds the GOOGLE_APPLICATION_CREDENTIALS_DATA
that cml was invoked with into the systemd service unit as those should be the credentials used for the creation of the instance and thus also should be used for the teardown of the instance.
The --cloud-permission-set
takes (in GCP's case) the service account email to attach to the instance, the intent behind that is for the application or ML model to use to access other services from the cloud provider like s3/object storage.
Are you saying it looks like terraform tried to use those (the --cloud-permission-set
) creds instead of the original cml runner
ones? That is definitely not intended.
This should be easy for me to reproduce and I'll try to get it fixed soon, if you are on discord and willing to test out a patch I can let you know when I have something working (dabarnes on discord)
Can you share more of the yml from gitlab-ci?
I think I misunderstood some of your last reply.
Can you share the permissions list or GCP managed role for the service account whose key should be set in the GOOGLE_APPLICATION_CREDENTIALS_DATA
environment variable when you invoke your cml runner
command and the same for the one you are trying to set with the --cloud-permission-set
?
Also, cat out the /etc/systemd/system/cml.service
file and be sure to check and redact any sensitive values that might be there
edit:
In the same screen as your first screenshot there's a metadata section. Can you verify GOOGLE_APPLICATION_CREDENTIALS_DATA
json is your service account you invoked cml runner
with, in the startup-script?
Can be closed with TPI fix: https://github.com/iterative/terraform-provider-iterative/pull/333
fixed by https://github.com/iterative/terraform-provider-iterative/pull/333; thanks @dacbd
Similar issue to https://github.com/iterative/cml/issues/678
I'm starting a self hosted runner via Gitlab CICD to GCP:
After the timeout the VM instance is not shutting down.
journalctl --unit cml --no-pager
command showsThe runner picks up a job correctly and the runner deregisters itself from the Gitlab project. The VM instance just does not shutdown.
On Azure similar config worked ok and the instances were shutting down