jenkinsci / hetzner-cloud-plugin

Hetzner cloud integration for Jenkins
https://plugins.jenkins.io/hetzner-cloud/
Apache License 2.0
24 stars 8 forks source link

HetznerCloud reports Cloud capacity reached in error #21

Closed sandrinr closed 2 years ago

sandrinr commented 2 years ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.340 OS: Linux - 5.4.0-66-generic --- ace-editor:1.1 analysis-model-api:10.9.4 ansicolor:1.0.1 ant:1.13 antisamy-markup-formatter:2.7 apache-httpcomponents-client-4-api:4.5.13-1.0 authentication-tokens:1.4 blueocean:1.25.3 blueocean-autofavorite:1.2.5 blueocean-bitbucket-pipeline:1.25.3 blueocean-commons:1.25.3 blueocean-config:1.25.3 blueocean-core-js:1.25.3 blueocean-dashboard:1.25.3 blueocean-display-url:2.4.1 blueocean-events:1.25.3 blueocean-git-pipeline:1.25.3 blueocean-github-pipeline:1.25.3 blueocean-i18n:1.25.3 blueocean-jira:1.25.3 blueocean-jwt:1.25.3 blueocean-personalization:1.25.3 blueocean-pipeline-api-impl:1.25.3 blueocean-pipeline-editor:1.25.3 blueocean-pipeline-scm-api:1.25.3 blueocean-rest:1.25.3 blueocean-rest-impl:1.25.3 blueocean-web:1.25.3 bootstrap4-api:4.6.0-3 bootstrap5-api:5.1.3-6 bouncycastle-api:2.25 branch-api:2.1044.v2c007e51b_87f build-blocker-plugin:1.7.8 build-monitor-plugin:1.13+build.202203221632 build-timeout:1.20 caffeine-api:2.9.2-29.v717aac953ff3 categorized-view:1.12 checks-api:1.7.2 claim:2.18.2 cloud-stats:0.27 cloudbees-bitbucket-branch-source:757.vddedc5f2589a_ cloudbees-folder:6.714.v79e858ef76a_2 cobertura:1.17 code-coverage-api:2.0.4 command-launcher:1.6 conditional-buildstep:1.4.2 config-file-provider:3.9.0 copyartifact:1.46.3 credentials:1087.v16065d268466 credentials-binding:1.27.1 dark-theme:156.v6cf16af6f9ef dashboard-view:2.19 data-tables-api:1.11.4-4 display-url-api:2.3.6 docker-commons:1.19 docker-compose-build-step:1.0 docker-custom-build-environment:1.7.3 docker-java-api:3.1.5.2 docker-plugin:1.2.6 docker-slaves:1.0.7 docker-workflow:1.28 downstream-build-cache:1.7 durable-task:495.v29cd95ec10f2 echarts-api:5.3.0-2 email-ext:2.87 envinject-api:1.180.v98d833b_27470 extended-read-permission:3.2 external-monitor-job:191.v363d0d1efdf8 extra-columns:1.25 favorite:2.4.1 font-awesome-api:6.0.0-1 forensics-api:1.12.0 git:4.11.0 git-client:3.11.0 git-server:1.10 github:1.34.3 github-api:1.301-378.v9807bd746da5 github-branch-source:1598.v91207e9f9b_4a_ gitlab-api:1.0.6 gitlab-branch-source:1.5.9 gitlab-plugin:1.5.29 gradle:1.38 greenballs:1.15.1 handlebars:3.0.8 handy-uri-templates-2-api:2.1.8-1.0 heavy-job:1.1 hetzner-cloud:34.vd59db8bd50f7 htmlpublisher:1.29 jackson2-api:2.13.2-260.v43d711474c77 jacoco:3.3.1 javadoc:217.v905b_86277a_2a_ javax-activation-api:1.2.0-2 javax-mail-api:1.6.2-5 jaxb:2.3.0.1 jdk-tool:1.5 jenkins-design-language:1.25.3 jersey2-api:2.35-4 jira:3.7 jjwt-api:0.11.2-9.c8b45b8bb173 jnr-posix-api:3.1.7-3 job-import-plugin:3.5 jobConfigHistory:1119.v509e1017356b_ jquery:1.12.4-1 jquery-detached:1.2.1 jquery3-api:3.6.0-2 jsch:0.1.55.2 junit:1.56 ldap:2.8 lockable-resources:2.14 mailer:408.vd726a_1130320 mapdb-api:1.0.9.0 matrix-auth:3.1 matrix-project:758.v7a_ea_491852f3 matrix-reloaded:1.1.3 maven-plugin:3.18 metrics:4.1.6.1 momentjs:1.1.1 naginator:1.18.1 nodelabelparameter:1.10.3 okhttp-api:4.9.3-105.vb96869f8ac3a pam-auth:1.7 parameterized-scheduler:1.0 parameterized-trigger:2.44 pipeline-build-step:2.16 pipeline-graph-analysis:188.v3a01e7973f2c pipeline-input-step:446.vf27b_0b_83500e pipeline-milestone-step:100.v60a_03cd446e1 pipeline-model-api:2.2064.v5eef7d0982b_e pipeline-model-definition:2.2064.v5eef7d0982b_e pipeline-model-extensions:2.2064.v5eef7d0982b_e pipeline-stage-step:291.vf0a8a7aeeb50 pipeline-stage-tags-metadata:2.2064.v5eef7d0982b_e pipeline-utility-steps:2.12.0 plain-credentials:1.8 plot:2.1.10 plugin-util-api:2.15.0 popper-api:1.16.1-2 popper2-api:2.11.4-1 postbuildscript:3.1.0-375.v3db_cd92485e1 prism-api:1.26.0-2 promoted-builds:873.v6149db_d64130 publish-over:0.22 pubsub-light:1.16 python:1.3 resource-disposer:0.18 run-condition:1.5 scm-api:595.vd5a_df5eb_0e39 script-security:1145.vb_cf6cf6ed960 scriptler:3.4 slack:608.v19e3b_44b_b_9ff snakeyaml-api:1.29.1 sse-gateway:1.25 ssh-agent:1.24.1 ssh-credentials:1.19 ssh-slaves:1.806.v2253cedd3295 sshd:3.1.0 structs:308.v852b473a2b8c subversion:2.15.3 text-finder:1.18 theme-manager:0.6 throttle-concurrents:2.7 timestamper:1.17 token-macro:285.vff7645a_56ff0 translation:1.16 trilead-api:1.0.13 variant:1.4 view-job-filters:2.3 warnings-ng:9.11.1 windows-slaves:1.8 workflow-api:1143.v2d42f1e9dea_5 workflow-basic-steps:941.vdfe1b_a_132c64 workflow-cps:2683.vd0a_8f6a_1c263 workflow-cps-global-lib:564.ve62a_4eb_b_e039 workflow-durable-task-step:1128.v8c259d125340 workflow-job:1174.vdcb_d054cf74a_ workflow-multibranch:711.vdfef37cda_816 workflow-scm-step:2.13 workflow-step-api:622.vb_8e7c15b_c95a_ workflow-support:815.vd60466279fc8 ws-cleanup:0.41 yet-another-build-visualizer:1.15 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Controller: Ubuntu 20.04 Nodes: Ubuntu 20.04

Reproduction steps

  1. Set maximum cloud nodes in /configureClouds/ Instance Cap to 10
  2. Let the Hetzner Cloud Plugin run
  3. Enjoy automatic scaling of Jenkins nodes
  4. At some point the Jenkins cloud console starts reporting Invalid API response : 412 and fails to start runners
  5. Looking at the Jenkins log one finds WARNING hudson.slaves.NodeProvisioner#update: Unexpected exception encountered while provisioning agent hcloud-enj1tr4eywrg3brp and WARNING c.d.j.p.hetzner.HetznerCloud#provision: Cloud capacity reached. Has 10 but want 9 more even though only four nodes are running

Cloud Console:

image

Cloud config:

image

Expected Results

Jenkins spawns nodes until it really reaches the configured limit

Actual Results

Jenkins stops launching new nodes even though the limit would allow it.

Anything else?

No response

rkosegi commented 2 years ago

Thank you for bug report. Can you confirm how many VM instances are actually in your hcloud project (eg. via web console or using CLI)? Also, would it be possible to configure debug logging for logger cloud.dnation.jenkins.plugins.hetzner in your instance and share relevant logs? One more query, how many executors are configured on agent hetzner-cloud-cicd?

Thing is, that I'm not really sure where error is coming from.

sandrinr commented 2 years ago

Thank your for picking this up.

Can you confirm how many VM instances are actually in your hcloud project (eg. via web console or using CLI)?

At that point exactly the 4 instances known to Jenkins were visible in hcloud.

Also, would it be possible to configure debug logging for logger cloud.dnation.jenkins.plugins.hetzner in your instance and share relevant logs?

Activated that now. When experiencing the issue I increased the maximum Instance Cap to get rid of the issue. I lowered it again to the old value. We only hit the issue twice until now. The last time was a month or so ago.

One more query, how many executors are configured on agent hetzner-cloud-cicd

6 per node. I was wondering whether that has any influence on the calculation.

rkosegi commented 2 years ago

Activated that now. When experiencing the issue I increased the maximum Instance Cap to get rid of the issue. I lowered it again to the old value. We only hit the issue twice until now. The last time was a month or so ago.

That's interesting information. Working theory is, that something went wrong in Hetzner cloud itself (like Invalid API response : 412 - which is not documented) and Jenkins just keep provisioning new agents to satisfy excess workload (which counts towards instance cap), but was not able to connect to newly created instances.

6 per node. I was wondering whether that has any influence on the calculation.

Actually it does. More executors per agent means less virtual machines. Looking at the code, and I have a feeling calculation might be wrong. Let me do some tests and possible fix.

To recap, error in report above is actually manifestation of something else and at same time, calculation of remaining capacity might be incorrect. Will keep you posted.

rkosegi commented 2 years ago

@sandrinr I fixed calculation of available capacity and new release should be available soon in update center. If you encounter any issue feel free to open new issue.

sandrinr commented 2 years ago

Thank your for the update!

Yesterday afternoon we again hit the the Invalid API response: 412 error. However, I think this time it was sth different.

2022-04-06 16:28:00.530+0000 [id=66]    INFO    h.s.NodeProvisioner$StandardStrategyImpl#apply: Started provisioning hcloud-2xa63r9g30avnohj from hetzner-cloud-cicd with 6 executors. Remaining excess workload: -5
2022-04-06 16:28:00.871+0000 [id=395687]        INFO    c.d.j.p.h.HetznerCloudResourceManager#searchResourceByLabelExpression: Trying to find single resource for label expression 'type=jenkins-node'
2022-04-06 16:28:09.571+0000 [id=52]    WARNING hudson.slaves.NodeProvisioner#update: Unexpected exception encountered while provisioning agent hcloud-2xa63r9g30avnohj
java.lang.IllegalStateException: Invalid API response : 412
        at com.google.common.base.Preconditions.checkState(Preconditions.java:534)
        at cloud.dnation.jenkins.plugins.hetzner.Helper.assertValidResponse(Helper.java:118)
        at cloud.dnation.jenkins.plugins.hetzner.HetznerCloudResourceManager.createServer(HetznerCloudResourceManager.java:233)
        at cloud.dnation.jenkins.plugins.hetzner.NodeCallable.call(NodeCallable.java:42)
        at cloud.dnation.jenkins.plugins.hetzner.NodeCallable.call(NodeCallable.java:30)
        at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
        at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

With errors like these I would expect to find stray Hetzner cloud node lying around. However, until now we did not have a single stray node we had to clean up manually. Somehow Jenkins manages to keep track of them and clean them up even in case of such errors.

rkosegi commented 2 years ago

With errors like these I would expect to find stray Hetzner cloud node lying around. However, until now we did not have a single stray node we had to clean up manually. Somehow Jenkins manages to keep track of them and clean them up even in case of such errors.

This happened during server creation, so it hasn't got chance to create stray server in cloud. Also, there is periodic job (once every 1 hour) that will remove stray servers automatically.

Still wondering how to provoke Hetzner cloud to produce this HTTP412. Maybe only way to found out is to start jenkins with this system property -Dcloud.dnation.jenkins.plugins.hetzner.client-debug=true which would put Hetzner API requests and responses into log. Let me think about it, there should be easier way to perform debug

rkosegi commented 2 years ago

With release 39, API requests/responses can be seen in log recorder if you configure ALL level for logger cloud.dnation.jenkins.plugins.hetzner.client

If you encounter HTTP412 response, you can search for string <-- 412 which should be followed by response body (in json format). That might contain additional details about what went wrong

sandrinr commented 2 years ago

Today in the morning, we got the 412 response again. I was able to capture the error:

--> POST https://api.hetzner.cloud/v1/servers
Content-Type: application/json; charset=UTF-8
Content-Length: 276
Authorization: ██
User-Agent: Jenkins Hetzner Plugin

{"name":"██","location":"fsn1","server_type":"cpx41","start_after_create":true,"image":"██","labels":{"jenkins.io/managed-by":"hetzner-jenkins-plugin","jenkins.io/cloud-name":"hetzner-cloud-cicd"},"ssh_keys":["██"]}

--> END POST (276-byte body)
<-- 412 https://api.hetzner.cloud/v1/servers (219ms)
date: Wed, 13 Apr 2022 06:30:58 GMT
content-type: application/json
content-length: 118
ratelimit-limit: 3600
ratelimit-remaining: 3598
ratelimit-reset: 1649831460
x-correlation-id: ██
strict-transport-security: max-age=15724800; includeSubDomains
access-control-allow-origin: *
access-control-allow-credentials: true

{
  "error": {
    "message": "error during placement",
    "code": "resource_unavailable",
    "details": null
  }
}

<-- END HTTP (118-byte body)

Meanwhile the error in the main Jenkins log was:

2022-04-13 06:30:46.533+0000 [id=66]    INFO    c.d.j.p.hetzner.HetznerCloud#provision: Creating new agent with 6 executors, have 0 running VMs
2022-04-13 06:30:46.534+0000 [id=66]    INFO    h.s.NodeProvisioner$StandardStrategyImpl#apply: Started provisioning  from hetzner-cloud-cicd with 6 executors. Remaining excess workload: -5
2022-04-13 06:30:46.917+0000 [id=177990]        INFO    c.d.j.p.h.HetznerCloudResourceManager#searchResourceByLabelExpression: Trying to find single resource for label expression 'type=jenkins-node'
2022-04-13 06:30:56.307+0000 [id=29]    WARNING hudson.slaves.NodeProvisioner#update: Unexpected exception encountered while provisioning agent ██
java.lang.IllegalStateException: Invalid API response : 412
        at com.google.common.base.Preconditions.checkState(Preconditions.java:534)
        at cloud.dnation.jenkins.plugins.hetzner.Helper.assertValidResponse(Helper.java:118)
        at cloud.dnation.jenkins.plugins.hetzner.HetznerCloudResourceManager.createServer(HetznerCloudResourceManager.java:233)
        at cloud.dnation.jenkins.plugins.hetzner.NodeCallable.call(NodeCallable.java:42)
        at cloud.dnation.jenkins.plugins.hetzner.NodeCallable.call(NodeCallable.java:30)
        at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
        at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

I think this is different from my initial report. Here, it really seems to be a Hetzner issue. However, when I do hcloud server create, which I guess does the same API call, with the same token then it works. However, this could be coincidence.

rkosegi commented 2 years ago

{ "error": { "message": "error during placement", "code": "resource_unavailable", "details": null } }

That confirms issue at Hetzner side. I suspect that there are no servers available at given datacenter. Similar issue reported here

sandrinr commented 2 years ago

Should anybody else stumble upon this, we solved the underlying issue by configuring multiple server templates in Jenkins' Cloud Configuration for Hetzner. One template for each location we want to support. The templates all spawn nodes with the same executor labels. I don't know how exactly the templates get selected this case but we observed downtimes due to resource placement issues anymore since that change Jenkins was always able to create resources using another template should one fail.