Open mattdlh opened 4 years ago
Also have this issue, would appreciate any help
@mattdlh, @dmitriykanarskiy thanks for reporting this. Can you look at the System logs for Jenkins and tell me if you see anything from this plugin regarding the "node cleanup work" when this happens?
Can you also tell me these configuration settings:
Thanks.
@stephenashank Looking through the logs I do not see anything regarding "node cleanup work" around when this happens (or at all). I see no messages from the plugin in the jenkins logs at all around the shutdown event.
One-Shot: unchecked Windows: no Preemtible: no
I also have this Problem, any help is appreciated
Another affected here... 😞
No node cleanup work
on logs either.
We tested different configurations, node retention times and in each of them we get the same error. We also tried to rollback to previous plugin version but we had to update again as the old versión has a problem rolling up the new instance, and tries again with a new one, and another one, and another one...
Some logs:
Dec 09, 2019 6:54:06 PM null
FINEST: Instance agent-app-qght92 is running and ready...
Dec 09, 2019 6:54:06 PM null
INFO: Launching instance: agent-app-qght92
Dec 09, 2019 6:54:06 PM null
INFO: bootstrap
Dec 09, 2019 6:54:06 PM null
INFO: Getting keypair...
Dec 09, 2019 6:54:06 PM null
INFO: Using autogenerated keypair
Dec 09, 2019 6:54:06 PM null
INFO: Authenticating as jenkins
Dec 09, 2019 6:54:07 PM null
INFO: Connecting to 35.X.Y.Z on port 22, with timeout 10000.
Dec 09, 2019 6:54:17 PM null
INFO: Failed to connect via ssh: The kexTimeout (10000 ms) expired.
Dec 09, 2019 6:54:17 PM null
INFO: Waiting for SSH to come up. Sleeping 5.
Dec 09, 2019 6:54:22 PM null
INFO: Connecting to 35.X.Y.Z on port 22, with timeout 10000.
Dec 09, 2019 6:54:26 PM null
WARNING: Failed to verify server host key because no host key metadata was available: 404 Not Found
{
"error": {
"errors": [
{
"domain": "global",
"reason": "notFound",
"message": "The resource 'hostkeys/' of type 'Guest Attribute' was not found."
}
],
"code": 404,
"message": "The resource 'hostkeys/' of type 'Guest Attribute' was not found."
}
}
Dec 09, 2019 6:54:26 PM null
INFO: Connected via SSH.
Dec 09, 2019 6:54:26 PM null
INFO: Verifying: java -fullversion
openjdk full version "1.8.0_181-8u181-b13-2~deb9u1-b13"
Dec 09, 2019 6:54:26 PM null
INFO: Copying agent.jar to: /tmp
Dec 09, 2019 6:54:27 PM null
INFO: Launching Jenkins agent via plugin SSH: java -jar /tmp/agent.jar
<===[JENKINS REMOTING CAPACITY]===>Remoting version: 3.36
This is a Unix agent
Evacuated stdout
Instance agent-app-qght92 is preemptive, setting up preemption listener
Preemptive instance, listening to metadata for preemption event
Agent successfully connected and online
ERROR: Connection terminated
java.io.EOFException
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
at hudson.remoting.Command.readFrom(Command.java:140)
at hudson.remoting.Command.readFrom(Command.java:126)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Caused: java.io.IOException: Unexpected termination of the channel
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Update: @craigdbarber
$ find /var/lib/jenkins/plugins -maxdepth 1 -type d -printf '%f,' | sort
plugins,jdk-tool,pam-auth,pipeline-graph-analysis,google-oauth-plugin,ws-cleanup,google-login,jsch,gcm-notification,google-cloud-health-check,credentials,pipeline-stage-tags-metadata,instant-messaging,google-compute-engine,pipeline-model-declarative-agent,pipeline-model-extensions,plain-credentials,kubernetes-client-api,ldap,credentials-binding,docker-plugin,google-source-plugin,docker-java-api,github-pullrequest,build-timeout,googleanalytics,windows-slaves,cloudbees-bitbucket-branch-source,docker-workflow,antisamy-markup-formatter,blueocean-rest-impl,jclouds-jenkins,groovy,jackson2-api,ansible,mapdb-api,pipeline-rest-api,multibranch-build-strategy-extension,token-macro,blueocean-core-js,blueocean-events,blueocean,simple-theme-plugin,github-api,git-client,blueocean-config,blueocean-jwt,google-api-client-plugin,htmlpublisher,config-file-provider,workflow-scm-step,subversion,ssh-slaves,pipeline-stage-step,timestamper,resource-disposer,google-storage-plugin,blueocean-bitbucket-pipeline,javadoc,blueocean-web,blueocean-pipeline-api-impl,ssh-steps,handlebars,workflow-job,lockable-resources,blueocean-rest,blueocean-autofavorite,google-analytics-usage-reporter,workflow-basic-steps,pubsub-light,workflow-cps-global-lib,jquery-detached,pipeline-build-step,icon-shim,durable-task,momentjs,blueocean-dashboard,cvs,apache-httpcomponents-client-4-api,workflow-aggregator,google-play-android-publisher,kubernetes-credentials,pipeline-model-api,junit,github,external-monitor-job,trilead-api,ignore-committer-strategy,google-deployment-manager,blueocean-display-url,google-container-registry-auth,sse-gateway,branch-api,blueocean-executor-info,chrome-frame-plugin,kubernetes,blueocean-git-pipeline,oauth-credentials,blueocean-personalization,slack,git-server,email-ext,greenballs,docker-commons,pipeline-input-step,blueocean-jira,build-user-vars-plugin,maven-plugin,script-security,pipeline-stage-view,pipeline-github-lib,github-organization-folder,variant,mercurial,workflow-multibranch,ssh-credentials,github-scm-trait-commit-skip,blueocean-github-pipeline,workflow-cps,structs,jquery-ui,ace-editor,matrix-auth,pipeline-model-definition,google-cloud-backup,handy-uri-templates-2-api,gradle,blueocean-pipeline-scm-api,github-branch-source,google-git-notes-publisher,pipeline-milestone-step,bouncycastle-api,favorite,ssh-agent,gcal,blueocean-i18n,workflow-support,workflow-step-api,scm-api,jira,display-url-api,jquery,matrix-project,blueocean-commons,command-launcher,cloudbees-folder,authentication-tokens,mailer,git,ant,workflow-durable-task-step,workflow-api,blueocean-pipeline-editor,google-metadata-plugin,jenkins-design-language
If you've a better method to obtain all the plugins+versions, I'll be glad to use it to provide it to you! Thx
Hi @mattdlh sorry to hear you're encountering issues. Could you please share some more info to help us better diagnose this:
Thanks.
@jesusdiez thanks for sharing the log. Just as an FYI, the log warning: "Failed to verify server host key because no host key metadata" does not indicate a job failure. It's just the plugin letting you know that it can't verify the server's ssh host key coming from this line: https://github.com/jenkinsci/google-compute-engine-plugin/blob/a654acd5bc9d912847462782bcd2f48fd94130cd/src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineComputerLauncher.java#L429
I've created an issue to help clear this up in the log's message: #168
Similar to the above comment, could you please provide us with some more env information to help us hunt down this issue.
@craigdbarber Thanks for the info! I already supposed that but didn't find a place in the config to disable that host key check. I've updated my original comment with the versions and installed plugins (all of them updated). Let me know if I can help debugging that in any way.
@craigdbarber , we've been checking and the instance also gets stopped when no agent.jar
process is running on it, or even when the jenkins service is stopped on master host... and also when both are happening: we forced everything trying to discover what is happening.
We're thinking about something that is being predefined on the instance creation that makes it get automatically stopped after around 7 minutes (we have no startup script
on our cloud definition).
@jesusdiez could you try rolling back to a previous version to see if the issue is still occurring?
Separately, I'd suggest following this troubleshooting guide on the topic to see if it helps alleviate the issue: https://wiki.jenkins.io/display/JENKINS/Remoting+issue If one of the steps they recommend does help, please do follow up here with the info.
@jesusdiez could you try rolling back to a previous version to see if the issue is still occurring?
We had to recover an old snapshot of the jenkins machine. We've been very conservative on this specific plugin updates, as we've suffered other bugs in the past. We were forced to update from 1.0.10 to 4.20 because we did other non-bc (ha!) system plugins update and each execution was creating infinite agent machines (something was broken on the master-agent connectivity and it didn't detect the new instance, creating a new one). We rolled back to the 1.0.10 and the multiple instance problem was happening again, so we had to roll back the rollback.
I'll check the Remoting Issue link you provided, as it looks very related to our scenario.
@craigdbarber
GCE Plugin Version: 4.2.0 and we have tried downgrading to 4.1.1 with the same issue. Jenkins Version: 2.190.3 List of other installed plugins/versions:
plugins,ace-editor,ghprb,antisamy-markup-formatter,branch-api,build-monitor-plugin,build-pipeline-plugin,build-user-vars-plugin,cloudbees-folder,conditional-buildstep,copyartifact,credentials-binding,credentials,git,durable-task,external-monitor-job,gerrit-trigger,gerrit,git-client,git-server,cvs,github-api,github-branch-source,plot,github-oauth,github-organization-folder,github-pullrequest,github,greenballs,handlebars,html5-notifier-plugin,icon-shim,javadoc,jquery-detached,jquery,junit,slack,log-parser,mailer,mapdb-api,matrix-auth,matrix-combinations-parameter,matrix-project,maven-plugin,momentjs,pam-auth,parameterized-trigger,phabricator-plugin,pipeline-build-step,pipeline-input-step,pipeline-rest-api,pipeline-stage-step,pipeline-stage-view,plain-credentials,ldap,promoted-builds,publish-over-ssh,run-condition,sauce-ondemand,scm-api,scm-sync-configuration,scp,script-security,ec2,sloccount,ssh-agent,ssh-credentials,ssh-slaves,statusmonitor,structs,subversion,test-results-analyzer,thinBackup,token-macro,translation,valgrind,windows-slaves,workflow-aggregator,workflow-api,workflow-basic-steps,workflow-cps-global-lib,workflow-cps,workflow-durable-task-step,workflow-job,workflow-multibranch,workflow-scm-step,workflow-step-api,workflow-support,covcomplplot,ant,view-job-filters,jsch,robot,jira,ansicolor,audit-trail,bouncycastle-api,http-post,docker-plugin,blueocean,package-drone,cobertura,build-timeout,node-iterator-api,timestamper,aws-credentials,pipeline-milestone-step,jackson2-api,docker-build-step,pipeline-stage-tags-metadata,blueocean-jwt,pipeline-model-declarative-agent,azure-commons,urltrigger,artifactdeployer,artifact-promotion,favorite,docker-commons,blueocean-web,pipeline-model-api,pipeline-model-extensions,aws-java-sdk,ansible,pipeline-graph-analysis,docker-workflow,metrics,cloud-stats,authentication-tokens,pipeline-github-lib,nexus-artifact-uploader,display-url-api,git-parameter,envinject,build-env-propagator,pipeline-model-definition,azure-credentials,performance,accelerated-build-now-plugin,azure-vm-agents,htmlpublisher,blueocean-jira,blueocean-config,blueocean-i18n,variant,sse-gateway,command-launcher,build-timestamp,rebuild,postbuild-task,blueocean-events,benchmark,blueocean-rest,blueocean-core-js,jdk-tool,envinject-api,pubsub-light,ws-cleanup,blueocean-dashboard,lockable-resources,email-ext,blueocean-bitbucket-pipeline,job-restrictions,docker-java-api,publish-over,blueocean-git-pipeline,jenkins-design-language,file-operations,blueocean-pipeline-scm-api,blueocean-pipeline-editor,mercurial,blueocean-display-url,blueocean-pipeline-api-impl,blueocean-commons,blueocean-autofavorite,blueocean-github-pipeline,blueocean-personalization,blueocean-rest-impl,apache-httpcomponents-client-4-api,cloudbees-bitbucket-branch-source,trilead-api,google-oauth-plugin,scoring-load-balancer,oauth-credentials,google-metadata-plugin,code-coverage-api,blueocean-executor-info,clang-scanbuild,google-storage-plugin,resource-disposer,flexible-publish,any-buildstep,elastic-axis,handy-uri-templates-2-api,google-compute-engine,google-cloudbuild,
@craigdbarber
I am getting the same issue. I setup a staging server that was working properly but when I re-built it for production it didn't. Since I still have the staging server around, I was able to compare the plugin versions. Maybe this can help.
Jenkins ver. 2.208 The unexpected thing is that both are using GCE Plugin Version: 4.2.0
--- plugins.working 2019-12-14 08:59:13.000000000 -0500
+++ plugins.failing 2019-12-14 08:58:55.000000000 -0500
@@ -1,4 +1,4 @@
-iace-editor 1.1 true
+ace-editor 1.1 true
ant 1.10 true
antisamy-markup-formatter 1.6 true
apache-httpcomponents-client-4-api 4.5.10-2.0 true
@@ -25,46 +25,49 @@
blueocean-rest-impl 1.21.0 true
blueocean-web 1.21.0 true
bouncycastle-api 2.17 true
-branch-api 2.5.4 true
+branch-api 2.5.5 true
+built-on-column 1.1 true
cisco-spark-notifier 1.1.1 true
-cloudbees-bitbucket-branch-source 2.5.0 true
-cloudbees-folder 6.9 true
-command-launcher 1.3 true
+cloudbees-bitbucket-branch-source 2.6.0 true
+cloudbees-folder 6.10.0 true
+command-launcher 1.4 true
conditional-buildstep 1.3.6 true
credentials 2.3.0 true
credentials-binding 1.20 true
display-url-api 2.3.2 true
docker-commons 1.15 true
-docker-workflow 1.20 true
-durable-task 1.30 true
+docker-workflow 1.21 true
+durable-task 1.33 true
email-ext 2.68 true
+envinject 2.3.0 true
+envinject-api 1.7 true
external-monitor-job 1.7 true
favorite 2.3.2 true
-git 3.12.1 true
-git-client 2.9.0 true
-git-server 1.8 true
-github 1.29.4 true
+git 4.0.0 true
+git-client 3.0.0 true
+git-server 1.9 true
+github 1.29.5 true
github-api 1.95 true
github-branch-source 2.5.8 true
google-compute-engine 4.2.0 true
-google-metadata-plugin 0.2 true
+google-metadata-plugin 0.3.1 true
google-oauth-plugin 1.0.0 true
google-storage-plugin 1.5.1 true
handlebars 1.1.1 true
-handy-uri-templates-2-api 2.1.7-1.0 true
+handy-uri-templates-2-api 2.1.8-1.0 true
htmlpublisher 1.21 true
-jackson2-api 2.10.0 true
+jackson2-api 2.10.1 true
javadoc 1.5 true
-jdk-tool 1.3 true
+jdk-tool 1.4 true
jenkins-design-language 1.21.0 true
-jira 3.0.10 true
+jenkins-multijob-plugin 1.32 true
+jira 3.0.11 true
jquery 1.12.4-1 true
jquery-detached 1.2.1 true
-jquery-ui 1.0.2 true
jsch 0.1.55.1 true
junit 1.28 true
ldap 1.21 true
-lockable-resources 2.5 true
+lockable-resources 2.7 true
mailer 1.29 true
mapdb-api 1.0.9.0 true
matrix-auth 2.5 true
@@ -74,23 +77,23 @@
momentjs 1.1.1 true
monitoring 1.80.0 true
nodelabelparameter 1.7.2 true
-oauth-credentials 0.3 true
+oauth-credentials 0.4 true
packer 1.5 true
-pam-auth 1.5.1 true
+pam-auth 1.6 true
Parameterized-Remote-Trigger 3.1.0 true
-parameterized-trigger 2.35.2 true
+parameterized-trigger 2.36 true
periodicbackup 1.5 true
-pipeline-build-step 2.9 true
+pipeline-build-step 2.10 true
pipeline-graph-analysis 1.10 true
pipeline-input-step 2.11 true
pipeline-milestone-step 1.3.1 true
-pipeline-model-api 1.3.9 true
+pipeline-model-api 1.5.0 true
pipeline-model-declarative-agent 1.1.1 true
-pipeline-model-definition 1.3.9 true
-pipeline-model-extensions 1.3.9 true
+pipeline-model-definition 1.5.0 true
+pipeline-model-extensions 1.5.0 true
pipeline-rest-api 2.12 true
pipeline-stage-step 2.3 true
-pipeline-stage-tags-metadata 1.3.9 true
+pipeline-stage-tags-metadata 1.5.0 true
pipeline-stage-view 2.12 true
pipeline-utility-steps 2.3.1 true
plain-credentials 1.5 true
@@ -98,7 +101,7 @@
role-strategy 2.15 true
run-condition 1.2 true
scm-api 2.6.3 true
-script-security 1.66 true
+script-security 1.68 true
sse-gateway 1.20 true
ssh-credentials 1.18 true
ssh-slaves 1.31.0 true
@@ -107,18 +110,18 @@
tap 2.3 true
test-results-analyzer 0.3.5 true
timestamper 1.10 true
-token-macro 2.8 true
+token-macro 2.10 true
trilead-api 1.0.5 true
variant 1.3 true
windows-slaves 1.5 true
workflow-aggregator 2.6 true
-workflow-api 2.37 true
+workflow-api 2.38 true
workflow-basic-steps 2.18 true
-workflow-cps 2.74 true
+workflow-cps 2.78 true
workflow-cps-global-lib 2.15 true
-workflow-durable-task-step 2.34 true
-workflow-job 2.35 true
+workflow-durable-task-step 2.35 true
+workflow-job 2.36 true
workflow-multibranch 2.21 true
workflow-scm-step 2.9 true
-workflow-step-api 2.20 true
+workflow-step-api 2.21 true
workflow-support 3.3 true
Another interesting detail is that the request to shutdown the VM always seems to occur at the same minutes of the hour. In my case: 2:51 PM, 3:51 PM , 4:51 PM. As a result, this is always the time of the failure:
grep "I/O" /var/log/jenkins/jenkins.log-20191214
2019-12-13 16:51:21.454+0000 [id=1368] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-2bz2rz
2019-12-13 16:51:22.007+0000 [id=1356] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-55ur2e
2019-12-13 16:51:22.572+0000 [id=1178] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-74wjdg
2019-12-13 16:51:23.224+0000 [id=765] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-9j55go
2019-12-13 16:51:23.644+0000 [id=1136] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-au7ons
2019-12-13 16:51:24.357+0000 [id=1267] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-b7ob8k
2019-12-13 16:51:24.921+0000 [id=663] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-kt3wlk
2019-12-13 16:51:25.471+0000 [id=1369] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-lujsg9
2019-12-13 16:51:26.025+0000 [id=610] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-n9qyjv
2019-12-13 16:51:26.563+0000 [id=1226] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-nlgn2n
2019-12-13 17:51:21.434+0000 [id=2105] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-237zal
2019-12-13 19:51:21.477+0000 [id=3599] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-4b0fls
2019-12-13 19:51:22.098+0000 [id=2824] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-eall9c
2019-12-13 19:51:22.680+0000 [id=3427] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-ttun92
2019-12-13 19:51:23.295+0000 [id=3429] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-xcwhql
2019-12-13 20:51:21.557+0000 [id=4758] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-dx6w2m
2019-12-13 20:51:22.174+0000 [id=4783] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-fiwx95
2019-12-13 20:51:22.669+0000 [id=4925] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-hmc0sa
2019-12-13 20:51:23.210+0000 [id=4376] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-k7t5uy
2019-12-13 22:51:21.526+0000 [id=401] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-39ock0
2019-12-13 22:51:22.172+0000 [id=544] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-q646kf
2019-12-13 22:51:22.698+0000 [id=291] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-s4x2hw
2019-12-13 22:51:23.524+0000 [id=490] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-z7jhbb
2019-12-14 07:51:21.547+0000 [id=8514] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-0f9sfy
2019-12-14 07:51:22.045+0000 [id=8615] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-1ig3oi
2019-12-14 07:51:22.544+0000 [id=8654] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-2ozbzq
2019-12-14 07:51:23.232+0000 [id=8728] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-7g0epo
For the record, I found out the culprit in my installation, the cloud instanceId/jenkins_cloud_id in the config file. It was duplicate! As a result, the working instance (which was idle), ended up cleaning up the apparently orphaned slave instances.
2019-12-14 19:50:58.781+0000 [id=30821] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished PeriodicBackup. 0 ms
2019-12-14 19:51:20.206+0000 [id=22] INFO c.g.j.p.c.CleanLostNodesWork#terminateInstance: Remote instance gce1-slave-xpn-rmjstp not found locally, removing it
2019-12-14 19:51:58.780+0000 [id=30829] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started PeriodicBackup
So for me, this issue is gone and my advice is too verify your cloud config XML.
@isabellf could you post your config XML for comparision ?
Anyone ever found a solution for this problem? It happens a lot for me, really annoying
In my case, I had a test instance of Jenkins which was basically a clone and hosted behind a different end point. Although all jobs were disabled on it, the plugin was quite active and reaping off slaves. The comment by @fisabelle solved the mystery, along with https://github.com/jenkinsci/google-compute-engine-plugin/issues/46
I am currently having the same problem. Reading the posts above I looked into my config.xml
and found the following:
<clouds>
<com.google.jenkins.plugins.computeengine.ComputeEngineCloud plugin="google-compute-engine@4.3.3">
...
<instanceId>abcdefgh-1234-1234-1234-abcdefghijkl</instanceId>
...
<googleLabels>
<entry>
<string>jenkins_cloud_id</string>
<string>abcdefgh-1234-1234-1234-abcdefghijkl</string>
</entry>
<entry>
<string>jenkins_config_name</string>
<string>name123</string>
</entry>
</googleLabels>
...
</com.google.jenkins.plugins.computeengine.ComputeEngineCloud>
</clouds>
I guess this is what @fisabelle is talking about. Could you please tell us how you fixed it? Do I remove the <instanceId>
part or the <string>jenkins_cloud_id</string>
entry part?
EDIT: I removed the <instanceId>
line and it fixed the problem for now.
I can confirm @robertauer solution above.
We were testing a new controller with a backup from our production controllers and were seeing GCP agents launched by both to run jobs being killed prematurely. Reviewing the GCP Compute Engine audit logs showed that the "other" Jenkins controllers was doing the kills (confirmed by source IP addresses of the destroy requests).
The solution was to shutdown the test controller, edit its config to remove the instanceID
line as @robertauer shows in his comment above, then restart the controller. A new instance id was then generated and all the google label entries were updated with this new id as well. After doing this we had no more conflict between controllers on agents.
A feature request would then be a button in the web UI to reset this instance id, or at least a warning in the documentation about this.
I encountered the exact same issue today. VMs shut down mid build, always at the same minute of the hour. I checked the instanceId
value in the configuration, and turns out I had two cloud configurations with the same ID.
Fixing the duplicate ID seems to fix the problem.
The problem started when I made a copy of a cloud configuiration. When you choose "Copy Existing Cloud" to create a new cloud configuration, it looks like it copies the entire configuration of the other cloud, including the instanceId
value.
@johanblumenberg what Jenkins and GCP plugin version are you using?
@johanblumenberg what Jenkins and GCP plugin version are you using?
Latest at the time of writing, 4.575.v6969b_7c435eb_
, Jenkins version 2.452.3
.
We have been using this plugin successfully for some time, but recently started having VMs be shut down mid-build. For example, a job that uses a GCP cloud will start, spin up a VM in GCP, and begin running on the VM; after about 45m-1hr the build will fail with a
Unexpected termination of the channel
and the VM will showstopping
in the GCP console.Looking at the stackdriver logs in GCP, it appears the API call comes from Jenkins to shut the VM off mid-build. In the Jenkins sytem logs, only the disconnect error itself shows, nothing from the GCE plugin regarding why the VM was terminated. For example:
Using the latest version of all plugins. Following variables set for the cloud: Instance cap: 1 Node Retention Time: 6 Launch Timeout: 600 And using instance template. Have tried tweaking all of the above settings with same result.