swiss-ai-center / a-guide-to-mlops

A simple yet complete guide to MLOps tools and practices - from a conventional way to a modern approach of working with ML projects.
https://mlops.swiss-ai-center.ch
Creative Commons Attribution Share Alike 4.0 International
18 stars 1 forks source link

bug: CML Runner Registration #126

Open leonardcser opened 11 months ago

leonardcser commented 11 months ago

For the chapter 15, CML successfully creates the runner on GCP, however it hangs on the setup-runner step of the workflow.

Behaviour

  1. The cicd starts on GitHub
  2. CML creates the runner on GCP
  3. The setup-runner step hangs on Terraform waiting: level":"info","message":"iterative_cml_runner.runner: Still creating...
  4. After 5-7mins, the GCP pod auto-terminates
  5. The GitHub workflow is still hanging with Terraform at the setup-runner step

Below is the output of the runner pod:

> kubectl logs -f cml-bo4s2uhzqs-2qx6z08y-ig1rgwq0-lg67g

Failed to get unit file state for cml.service: No such file or directory
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 84.5M  100 84.5M    0     0  28.4M      0  0:00:02  0:00:02 --:--:-- 37.8M
bash: line 24: lsof: command not found
{"level":"info","message":"POST /repos/leonardcser/mlops-test/actions/runners/registration-token - 201 in 275ms"}
{"level":"info","message":"GET /repos/leonardcser/mlops-test/actions/runners?per_page=100 - 200 in 215ms"}
{"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Launching github runner"}
{"level":"info","message":"Terraform 1.5.4"}
{"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
{"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
{"level":"info","message":"Outputs: 0"}
{"level":"warn","message":"Error connecting to ACPI socket: connect ENOENT /var/run/acpid.socket. The acpid.service helps with instance termination detection."}
{"level":"info","message":"POST /repos/leonardcser/mlops-test/actions/runners/registration-token - 201 in 317ms"}
{"date":"2023-08-03T09:15:06.304Z","level":"info","message":"runner status","repo":"https://github.com/leonardcser/mlops-test","status":"ready"}
{"level":"info","message":"Unregistering runner cml-bo4s2uhzqs-2qx6z08y-ig1rgwq0..."}
{"level":"info","message":"GET /repos/leonardcser/mlops-test/actions/runners?per_page=100 - 200 in 277ms"}
{"level":"info","message":"DELETE /repos/leonardcser/mlops-test/actions/runners/23 - 204 in 360ms"}
{"level":"info","message":"\tSuccess"}
{"level":"info","message":"Waiting 10 seconds to destroy"}

This output is similar to this issue on CML: https://github.com/iterative/cml/issues/1332

ludelafo commented 10 months ago

I can confirm having the same issue on my side. I don't have a clue why it doesn't work anymore but I'll let you know when I've found something.

ludelafo commented 4 months ago

@rmarquis, @leonardcser, I have added a new comment to the CML issue I have opened last year regarding this issue that you can find here: https://github.com/iterative/cml/issues/1415#issuecomment-1969077905.