iterative / cml.dev

🔗 CML website and documentation
https://cml.dev
Apache License 2.0
12 stars 23 forks source link

Self-hosted workers fail immediately, get marked "offline" in the runners list. #388

Open mikolajpabiszczak opened 2 years ago

mikolajpabiszczak commented 2 years ago

I am frankly not sure if this is the issue on CML side, but let me describe it.

CML versions tested: 0.11.0 and 0.17.0 Cloud provider: AWS

Remark: the very same workflow worked when I last used it (3 months ago)

  1. Deploy self-hosted runner:

          [...]
          cml runner \
              --cloud=aws \
              --cloud-region=eu-west-1 \
              --cloud-type=g3s.xlarge \
              --cloud-spot \
              --single \
              --cloud-startup-script=$(echo 'echo "$(curl https://github.com/${{ github.actor }}.keys)" >> /home/ubuntu/.ssh/authorized_keys' | base64 -w 0) \
              --labels=debug
          [...]

    this deployment job finishes successfully, but when it finishes the instance (as checked in AWS console) has not yet performed status checks (this was not the case when the workflow worked 3 last time) / is still in the Initialisation stage.

  2. The next job (which runs on self-hosted runner) gets closed basically immediately (in 4s): The runner has received a shutdown signal although the instance itself is not getting cancelled: it goes through AWS status checks and remains running (to clarify: instance deployed as single),

One more thing: if I deploy the worker as reusable it will be marked as offline in the list of workers after the job fails and will not be accessible…

I deployed the reusable instance and got logs after failure:

ubuntu@ip-172-31-32-70:~$ journalctl -u cml.service -f
-- Logs begin at Thu 2022-07-21 01:23:30 UTC. --
Jul 22 11:36:49 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Outputs: 0"}
Jul 22 11:36:49 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Connected to acpid service."}
Jul 22 11:37:18 ip-172-31-32-70 cml.sh[2440]: {"date":"2022-07-22T11:37:18.362Z","level":"info","message":"runner status","repo":"https://github.com/xxxx/yyyy","status":"ready"}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"date":"Fri Jul 22 2022 11:37:30 GMT+0000 (Coordinated Universal Time)","error":{"name":"HttpError","request":{"headers":{"accept":"application/vnd.github.v3+json","authorization":"token [REDACTED]","user-agent":"octokit-rest.js/18.0.0 octokit-core.js/3.6.0 Node.js/16.16.0 (linux; x64)"},"method":"GET","request":{"agent":{}},"url":"https://api.github.com/repos/xxxx/yyyy/actions/runs?status=queued"},"response":{"data":{"documentation_url":"https://docs.github.com/rest/reference/actions#list-workflow-runs-for-a-repository","message":"Resource not accessible by integration"},"headers":{"access-control-allow-origin":"*","access-control-expose-headers":"ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset","connection":"close","content-encoding":"gzip","content-security-policy":"default-src 'none'","content-type":"application/json; charset=utf-8","date":"Fri, 22 Jul 2022 11:37:30 GMT","referrer-policy":"origin-when-cross-origin, strict-origin-when-cross-origin","server":"GitHub.com","strict-transport-security":"max-age=31536000; includeSubdomains; preload","transfer-encoding":"chunked","vary":"Accept-Encoding, Accept, X-Requested-With","x-content-type-options":"nosniff","x-frame-options":"deny","x-github-media-type":"github.v3; format=json","x-github-request-id":"ABF8:0EFF:39DD5D:40D374:62DA8BFA","x-ratelimit-limit":"5000","x-ratelimit-remaining":"4975","x-ratelimit-reset":"1658491535","x-ratelimit-resource":"core","x-ratelimit-used":"25","x-xss-protection":"0"},"status":403,"url":"https://api.github.com/repos/xxxx/yyyy/actions/runs?status=queued"},"status":403},"exception":true,"level":"error","message":"unhandledRejection: Resource not accessible by integration\nHttpError: Resource not accessible by integration\n    at /snapshot/cml/node_modules/@octokit/request/dist-node/index.js:86:21\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)\n    at async Job.doExecute (/snapshot/cml/node_modules/bottleneck/light.js:405:18)","os":{"loadavg":[1.05,0.6,0.24],"uptime":146.55},"process":{"argv":["/usr/bin/cml-internal","/snapshot/cml/bin/cml.js","runner","--name","cml-4l6sv1qiu1","--labels","debug","--idle-timeout","300","--driver","github","--repo","https://github.com/xxxx/yyyy","--token","ghs_wDbiPdDx3S0wjEm4hvt0v0v0v037P54OliM1","--tf-resource","eyJtb2RlIjoibWFuYWdlZCIsInR5cGUiOiJpdGVyYXRpdmVfY21sX3J1bm5lciIsIm5hbWUiOiJydW5uZXIiLCJwcm92aWRlciI6InByb3ZpZGVyW1wicmVnaXN0cnkudGVycmFmb3JtLmlvL2l0ZXJhdGl2ZS9pdGVyYXRpdmVcIl0iLCJpbnN0YW5jZXMiOlt7InByaXZhdGUiOiIiLCJzY2hlbWFfdmVyc2lvbiI6MCwiYXR0cmlidXRlcyI6eyJuYW1lIjoiY21sLTRsNnN2MXFpdTEiLCJsYWJlbHMiOiIiLCJpZGxlX3RpbWVvdXQiOjMwMCwicmVwbyI6IiIsInRva2VuIjoiIiwiZHJpdmVyIjoiIiwiY2xvdWQiOiJhd3MiLCJjdXN0b21fZGF0YSI6IiIsImlkIjoiaXRlcmF0aXZlLTJvNzh2ZXFjOHJrZ2kiLCJpbWFnZSI6IiIsImluc3RhbmNlX2dwdSI6IiIsImluc3RhbmNlX2hkZF9zaXplIjozNSwiaW5zdGFuY2VfaXAiOiIiLCJpbnN0YW5jZV9sYXVuY2hfdGltZSI6IiIsImluc3RhbmNlX3R5cGUiOiIiLCJyZWdpb24iOiJldS13ZXN0LTEiLCJzc2hfbmFtZSI6IiIsInNzaF9wcml2YXRlIjoiIiwic3NoX3B1YmxpYyI6IiIsImF3c19zZWN1cml0eV9ncm91cCI6IiJ9fV19"],"cwd":"/","execPath":"/usr/bin/cml-internal","gid":0,"memoryUsage":{"arrayBuffers":15632910,"external":33348698,"heapTotal":106082304,"heapUsed":75520952,"rss":311275520},"pid":2440,"uid":0,"version":"v16.16.0"},"stack":"HttpError: Resource not accessible by integration\n    at /snapshot/cml/node_modules/@octokit/request/dist-node/index.js:86:21\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)\n    at async Job.doExecute (/snapshot/cml/node_modules/bottleneck/light.js:405:18)","trace":[{"column":21,"file":"/snapshot/cml/node_modules/@octokit/request/dist-node/index.js","function":null,"line":86,"method":null,"native":false},{"column":null,"file":null,"function":"runMicrotasks","line":null,"method":null,"native":false},{"column":5,"file":"node:internal/process/task_queues","function":"processTicksAndRejections","line":96,"method":null,"native":false},{"column":18,"file":"/snapshot/cml/node_modules/bottleneck/light.js","function":"async Job.doExecute","line":405,"method":"doExecute","native":false}]}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"error","message":"HttpError: Resource not accessible by integration","stack":"Error: HttpError: Resource not accessible by integration\n    at process.<anonymous> (/snapshot/cml/bin/cml/runner.js:333:32)\n    at process.emit (node:events:539:35)\n    at emit (node:internal/process/promises:140:20)\n    at processPromiseRejections (node:internal/process/promises:274:27)\n    at processTicksAndRejections (node:internal/process/task_queues:97:32)","status":"terminated"}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Unregistering runner cml-4l6sv1qiu1..."}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-4l6sv1qiu1\" is still running a job\""}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Waiting 10 seconds to destroy"}
Jul 22 11:37:33 ip-172-31-32-70 systemd[1]: cml.service: Main process exited, code=exited, status=1/FAILURE
Jul 22 11:37:35 ip-172-31-32-70 systemd[1]: cml.service: Failed with result 'exit-code'.
DavidGOrtega commented 2 years ago

:wave: @mikolajpabiszczak the reason is because the runner has been marked to do just one job with the parameter --single the option that you might be looking for is --reuse

mikolajpabiszczak commented 2 years ago

@DavidGOrtega: I do know that. So let me emphasise this again:

  1. the problem is not about single vs. reusable (I know and understand the difference between those). In both cases the workflow does not work (and it worked 3 months ago). I used reusable only to collect the logs provided and to see whether GitHub sees the runner (it does not: it marks it as offline). In fact all the workflows (using CML) that I tested do not work (but worked 3 months ago)

  2. Moreover, if I use the reusable runner, and I try to run the failed job again it does not pick up the already existing runner (bc. GitHub sees it as offline).

  3. In case I use single the instance does not get cancelled after the failure, I have to terminate it manually.

(I added some clarifications in the opening message)

DavidGOrtega commented 2 years ago

@mikolajpabiszczak You have in your logs

Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"error","message":"HttpError: Resource not accessible by integration","stack":"Error: HttpError: Resource not accessible by integration\n    at process.<anonymous> (/snapshot/cml/bin/cml/runner.js:333:32)\n    at process.emit (node:events:539:35)\n    at emit (node:internal/process/promises:140:20)\n    at processPromiseRejections (node:internal/process/promises:274:27)\n    at processTicksAndRejections (node:internal/process/task_queues:97:32)","status":"terminated"}

There must be something that you do not have permissions to do with your token? Then the unregistering can not happen yet because there is still a job in play

DavidGOrtega commented 2 years ago

Just to be sure and move one step forward can you please your REPO_TOKEN? Does it have all all the permissions?

mikolajpabiszczak commented 2 years ago

These were not changed since the working runs, but I checked it again. We are using a company application, so checking up wrt. this list

Repository level:

Organisation level:

Additionally, in the repository settings:

dacbd commented 2 years ago

It looks like that app needs an additional scope it might not have? https://docs.github.com/en/rest/actions/workflow-runs#list-workflow-runs-for-a-repository

@mikolajpabiszczak to confirm is an issue with app generated token can you try and curl the endpoint with one of the generated tokens?

curl \
  -H "Accept: application/vnd.github+json" \ 
  -H "Authorization: token <TOKEN>" \
  https://api.github.com/repos/OWNER/REPO/actions/runs
dacbd commented 2 years ago

we might need to update our guide for using a github app? image

mikolajpabiszczak commented 2 years ago

Did some tests, indeed the culprit was the lack of sufficient permissions: after adding Read and write permissions for Actions the workflows work again.

Thx for your time and help! And yes, the guide needs an update in this case. ;D

dacbd commented 2 years ago

@mikolajpabiszczak thanks for the report and help, we'll keep this open until we update the docs