jenkinsci / google-compute-engine-plugin

https://plugins.jenkins.io/google-compute-engine/
Apache License 2.0
55 stars 84 forks source link

Jenkins losing connection to GCE VM / GCE VM shutting down #467

Open j-koehler opened 1 month ago

j-koehler commented 1 month ago

Jenkins and plugins versions report

Environment Jenkins: 2.452.1 OS: Linux - 5.10.0-22-amd64 Java: 17.0.11 - Eclipse Adoptium (OpenJDK 64-Bit Server VM) --- PrioritySorter:5.1.0 adoptopenjdk:1.5 amazon-ecs:1.49 ansicolor:1.0.4 ant:497.v94e7d9fffa_b_9 antisamy-markup-formatter:162.v0e6ec0fcfcf6 apache-httpcomponents-client-4-api:4.5.14-208.v438351942757 apache-httpcomponents-client-5-api:5.3.1-1.0 asm-api:9.7-33.v4d23ef79fcc8 authentication-tokens:1.113.v81215a_241826 aws-credentials:231.v08a_59f17d742 aws-java-sdk:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-api-gateway:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-autoscaling:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-cloudformation:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-cloudfront:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-codebuild:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-codedeploy:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-ec2:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-ecr:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-ecs:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-efs:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-elasticbeanstalk:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-elasticloadbalancingv2:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-iam:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-kinesis:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-lambda:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-logs:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-minimal:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-organizations:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-secretsmanager:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-sns:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-sqs:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-ssm:1.12.696-451.v0651a_da_9ca_ec basic-branch-build-strategies:81.v05e333931c7d blueocean:1.27.12 blueocean-autofavorite:1.2.5 blueocean-bitbucket-pipeline:1.27.12 blueocean-commons:1.27.12 blueocean-config:1.27.12 blueocean-core-js:1.27.12 blueocean-dashboard:1.27.12 blueocean-display-url:2.4.2 blueocean-events:1.27.12 blueocean-git-pipeline:1.27.12 blueocean-github-pipeline:1.27.12 blueocean-i18n:1.27.12 blueocean-jira:1.27.12 blueocean-jwt:1.27.12 blueocean-personalization:1.27.12 blueocean-pipeline-api-impl:1.27.12 blueocean-pipeline-editor:1.27.12 blueocean-pipeline-scm-api:1.27.12 blueocean-rest:1.27.12 blueocean-rest-impl:1.27.12 blueocean-web:1.27.12 bootstrap5-api:5.3.3-1 bouncycastle-api:2.30.1.78.1-233.vfdcdeb_0a_08a_a_ branch-api:2.1169.va_f810c56e895 browserstack-integration:1.2.13 build-name-setter:2.4.2 build-timeout:1.32 caffeine-api:3.1.8-133.v17b_1ff2e0599 checks-api:2.2.0 cloud-stats:336.v788e4055508b_ cloudbees-bitbucket-branch-source:886.v44cf5e4ecec5 cloudbees-disk-usage-simple:203.v3f46a_7462b_1a_ cloudbees-folder:6.928.v7c780211d66e cobertura:1.17 code-coverage-api:4.99.0 command-launcher:107.v773860566e2e commons-httpclient3-api:3.1-3 commons-lang3-api:3.13.0-62.v7d18e55f51e2 commons-text-api:1.11.0-109.vfe16c66636eb_ conditional-buildstep:1.4.3 config-file-provider:973.vb_a_80ecb_9a_4d0 configuration-as-code:1810.v9b_c30a_249a_4c copyartifact:722.v0662a_9b_e22a_c coverage:1.14.0 credentials:1337.v60b_d7b_c7b_c9f credentials-binding:677.vdc9d38cb_254d dashboard-view:2.508.va_74654f026d1 data-tables-api:2.0.7-1 dependency-check-jenkins-plugin:5.5.0 dependency-track:4.3.1 display-url-api:2.204.vf6fddd8a_8b_e9 docker-commons:439.va_3cb_0a_6a_fb_29 docker-compose-build-step:1.0 docker-java-api:3.3.4-86.v39b_a_5ede342c docker-plugin:1.6.1 docker-slaves:1.0.7 docker-workflow:580.vc0c340686b_54 durable-task:555.v6802fe0f0b_82 echarts-api:5.5.0-1 email-ext:1814.v404722f34263 embeddable-build-status:487.va_0ef04c898a_2 envinject:2.908.v66a_774b_31d93 envinject-api:1.199.v3ce31253ed13 extended-read-permission:53.v6499940139e5 external-monitor-job:215.v2e88e894db_f8 favorite:2.208.v91d65b_7792a_c file-operations:214.v2e7dc7f25757 font-awesome-api:6.5.2-1 forensics-api:2.4.0 git:5.2.2 git-client:4.7.0 git-server:126.v0d945d8d2b_39 github:1.39.0 github-api:1.318-461.v7a_c09c9fa_d63 github-branch-source:1789.v5b_0c0cea_18c3 gitlab-api:5.3.0-91.v1f9a_fda_d654f gitlab-branch-source:704.vc7f1202d7e14 gitlab-plugin:1.8.1 google-compute-engine:4.563.vfa_446a_7e00a_d google-oauth-plugin:1.330.vf5e86021cb_ec google-play-android-publisher:4.2 gradle:2.12 gson-api:2.11.0-41.v019fcf6125dc h2-api:11.1.4.199-30.v1c64e772f3a_c handy-uri-templates-2-api:2.1.8-30.v7e777411b_148 htmlpublisher:1.34 http_request:1.18 instance-identity:185.v303dc7c645f9 ionicons-api:74.v93d5eb_813d5f jackson2-api:2.17.0-379.v02de8ec9f64c jacoco:3.3.6 jakarta-activation-api:2.1.3-1 jakarta-mail-api:2.1.3-1 javadoc:243.vb_b_503b_b_45537 javax-activation-api:1.2.0-7 javax-mail-api:1.6.2-10 jaxb:2.3.9-1 jdk-tool:73.vddf737284550 jenkins-design-language:1.27.12 jenkins-jira-plugin:4.0.0 jersey2-api:2.42-147.va_28a_44603b_d5 jira:3.13 jira-ext:114.v7b_8b_1d4274c6 jira-steps:2.0.165.v8846cf59f3db jjwt-api:0.11.5-112.ve82dfb_224b_a_d jnr-posix-api:3.1.19-2 job-dsl:1.87 job-restrictions:0.8 joda-time-api:2.12.7-29.v5a_b_e3a_82269a_ jquery3-api:3.7.1-2 jsch:0.2.16-86.v42e010d9484b_ json-api:20240303-41.v94e11e6de726 json-path-api:2.9.0-58.v62e3e85b_a_655 junit:1265.v65b_14fa_f12f0 ldap:725.v3cb_b_711b_1a_ef lockable-resources:1255.vf48745da_35d0 mailer:472.vf7c289a_4b_420 mask-passwords:173.v6a_077a_291eb_5 matrix-auth:3.2.2 matrix-project:830.v7ea_da_561b_a_34 maven-plugin:3.23 mercurial:1260.vdfb_723cdcc81 metrics:4.2.21-451.vd51df8df52ec mina-sshd-api-common:2.12.1-101.v85b_e08b_780dd mina-sshd-api-core:2.12.1-101.v85b_e08b_780dd next-build-number:1.8 nexus-artifact-uploader:2.14 nexus-task-runner:0.9.2 nodejs:1.6.1 oauth-credentials:0.653.v14cf2088e950 okhttp-api:4.11.0-172.vda_da_1feeb_c6e pam-auth:1.10 parameterized-trigger:806.vf6fff3e28c3e performance:960.v701db_0d66e1b_ pipeline-aws:1.45 pipeline-build-step:540.vb_e8849e1a_b_d8 pipeline-github-lib:61.v629f2cc41d83 pipeline-graph-analysis:216.vfd8b_ece330ca_ pipeline-groovy-lib:710.v4b_94b_077a_808 pipeline-input-step:495.ve9c153f6067b_ pipeline-maven:1416.v54f125b_3e1c7 pipeline-maven-api:1416.v54f125b_3e1c7 pipeline-milestone-step:119.vdfdc43fc3b_9a_ pipeline-model-api:2.2198.v41dd8ef6dd56 pipeline-model-definition:2.2198.v41dd8ef6dd56 pipeline-model-extensions:2.2198.v41dd8ef6dd56 pipeline-rest-api:2.34 pipeline-stage-step:312.v8cd10304c27a_ pipeline-stage-tags-metadata:2.2198.v41dd8ef6dd56 pipeline-stage-view:2.34 pipeline-utility-steps:2.16.2 plain-credentials:182.v468b_97b_9dcb_8 plot:2.1.12 plugin-util-api:4.1.0 prism-api:1.29.0-15 publish-over:0.22 publish-over-ftp:1.17 publish-over-ssh:1.25 pubsub-light:1.18 rebuild:332.va_1ee476d8f6d resource-disposer:0.23 run-condition:1.7 s3:466.vf5b_3db_8e3eb_2 scm-api:690.vfc8b_54395023 script-security:1336.vf33a_a_9863911 snakeyaml-api:2.2-111.vc6598e30cc65 snyk-security-scanner:4.0.3 sonar:2.17.2 sse-gateway:1.26 ssh-agent:367.vf9076cd4ee21 ssh-credentials:337.v395d2403ccd4 ssh-slaves:2.966.vea_5112da_0a_a_1 sshd:3.322.v159e91f6a_550 structs:337.v1b_04ea_4df7c8 swarm:3.46 thinBackup:2.0 timestamper:1.27 token-macro:400.v35420b_922dcb_ trilead-api:2.142.v748523a_76693 variant:60.v7290fc0eb_b_cd view-job-filters:377.v66f4b_796e5fa_ workflow-aggregator:596.v8c21c963d92d workflow-api:1316.v33eb_726c50b_a_ workflow-basic-steps:1058.vcb_fc1e3a_21a_9 workflow-cps:3894.3896.vca_2c931e7935 workflow-durable-task-step:1353.v1891a_b_01da_18 workflow-job:1400.v7fd111b_ec82f workflow-multibranch:783.va_6eb_ef636fb_d workflow-scm-step:427.v4ca_6512e7df1 workflow-step-api:657.v03b_e8115821b_ workflow-support:907.v6713a_ed8a_573 ws-cleanup:0.46

What Operating System are you using (both controller, and any agents involved in the problem)?

Debian 12

Reproduction steps

Use GCP for (oneshot-)agents

Expected Results

Actual Results

[2024-05-28T08:24:30.268Z] Cannot contact gcp-rre-debian12-wrgi83: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@790c3757:gcp-rre-debian12-wrgi83": Remote call on gcp-rre-debian12-wrgi83 failed. The channel is closing down or has closed down
[2024-05-28T08:25:56.170Z] Could not connect to gcp-rre-debian12-wrgi83 to send interrupt signal to process

Anything else?

I'm not sure where to look for this error, not blaming this project by itself but it's very hard to grasp.

We were using Jenkins 2.440.2 with GCE Plugin 4.563.vfa_446a_7e00a_d before without any of these problems. After upgrading to 2.452.1 (including all of the plugins including GCE Plugin (to 4.573.v7dcd6a_37a_ee2) to problems began to start with strange errors like pasted into the actual results.

It did not happen every time, but quite often (maybe 20% yes, 80% no). Rolling back the plugin to 4.563.vfa_446a_7e00a_d (including restarting Jenkins to apply) did not help.

In GCP there are multiple "DELETE" statements for one machine finally resulting in an error but unfortunately I don't know what's "normal" (the usual problem when you look closer at the logs once problems begin to arise):

2024-05-28 10:09:07.764 CEST Compute Engine insert europe-west3-c:gcp-rre-debian12-wrgi83 ...
2024-05-28 10:09:14.732 CEST Compute Engine insert europe-west3-c:gcp-rre-debian12-wrgi83 ...
2024-05-28 10:23:56.154 CEST Compute Engine delete europe-west3-c:gcp-rre-debian12-wrgi83 ...
2024-05-28 10:24:42.072 CEST Compute Engine delete europe-west3-c:gcp-rre-debian12-wrgi83 ...
2024-05-28 10:25:55.799 CEST Compute Engine delete europe-west3-c:gcp-rre-debian12-wrgi83 ...

Are you interested in contributing a fix?

No response

j-koehler commented 2 weeks ago

Since the issue is still around even after:

I checked the GCP logs some more. It looks like for whatever reason GCP is receiving an DELETE for the VM while the job is still running.

Here are the logs for a successful jobs (I added the NOTICE/ERROR depending on the icon)

NOTICE 2024-06-18 08:15:15.251 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
NOTICE 2024-06-18 08:15:21.166 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
NOTICE 2024-06-18 08:34:54.780 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-di1edw ...
NOTICE 2024-06-18 08:35:40.626 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-di1edw ...

and here the logs for a failing job:

NOTICE 2024-06-18 07:46:45.363 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
NOTICE 2024-06-18 07:47:00.887 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
NOTICE 2024-06-18 08:04:13.367 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
NOTICE 2024-06-18 08:04:59.185 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...
ERROR 2024-06-18 08:05:02.081 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-jkt5ag ...

In Jenkins itself it looks like this (note: times are UTC here):

...
[2024-06-18T06:04:43.784Z] PASS src/view/store/tracking/suspendData/tracking.suspend.data.setSuspendDataIfSectionIsVisited.epic.test.ts
[2024-06-18T06:04:44.494Z] Cannot contact gcp-rre-unittest-debian12-jkt5ag: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@1362ebbd:gcp-rre-unittest-debian12-jkt5ag": Remote call on gcp-rre-unittest-debian12-jkt5ag failed. The channel is closing down or has closed down
[2024-06-18T06:05:02.226Z] Could not connect to gcp-rre-unittest-debian12-jkt5ag to send interrupt signal to process

So from my perspective (without much insight) it looks like the delete at 2024-06-18 08:04:13.367 CEST is causing the trouble. It looks like it sends a delete, GCP is starting to shut down. It loses connection, wants to cleanup (the VMs are configured as "one shot" instances, so it sends another delete at 2024-06-18 08:04:59.185 which then causes the error at 2024-06-18 08:05:02.081 (since the VM is already gone).

There is nothing unusual at the 06:04:13 (aka 2024-06-18 08:04:13.367 CEST in GCP). Just some PASSes, not even a single entry for the exact 06:04:13 second.

I don't know who (which plugin) might cause this. Can this GCE plugin even cause this?