jenkinsci / google-compute-engine-plugin

https://plugins.jenkins.io/google-compute-engine/
Apache License 2.0
57 stars 85 forks source link

Launched VMs are shut down during builds #157

Open mattdlh opened 4 years ago

mattdlh commented 4 years ago

We have been using this plugin successfully for some time, but recently started having VMs be shut down mid-build. For example, a job that uses a GCP cloud will start, spin up a VM in GCP, and begin running on the VM; after about 45m-1hr the build will fail with a Unexpected termination of the channel and the VM will show stopping in the GCP console.

Looking at the stackdriver logs in GCP, it appears the API call comes from Jenkins to shut the VM off mid-build. In the Jenkins sytem logs, only the disconnect error itself shows, nothing from the GCE plugin regarding why the VM was terminated. For example:

I/O error in channel jenkins-slave-w8qs5x
java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2638)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3113)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:349)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Caused: java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)

Using the latest version of all plugins. Following variables set for the cloud: Instance cap: 1 Node Retention Time: 6 Launch Timeout: 600 And using instance template. Have tried tweaking all of the above settings with same result.

dmitriykanarskiy commented 4 years ago

Also have this issue, would appreciate any help

stephenashank commented 4 years ago

@mattdlh, @dmitriykanarskiy thanks for reporting this. Can you look at the System logs for Jenkins and tell me if you see anything from this plugin regarding the "node cleanup work" when this happens?

Can you also tell me these configuration settings:

Thanks.

mattdlh commented 4 years ago

@stephenashank Looking through the logs I do not see anything regarding "node cleanup work" around when this happens (or at all). I see no messages from the plugin in the jenkins logs at all around the shutdown event.

One-Shot: unchecked Windows: no Preemtible: no

Luschgy commented 4 years ago

I also have this Problem, any help is appreciated

jesusdiez commented 4 years ago

Another affected here... 😞

No node cleanup work on logs either.

We tested different configurations, node retention times and in each of them we get the same error. We also tried to rollback to previous plugin version but we had to update again as the old versión has a problem rolling up the new instance, and tries again with a new one, and another one, and another one...

Some logs:

Dec 09, 2019 6:54:06 PM null
FINEST: Instance agent-app-qght92 is running and ready...
Dec 09, 2019 6:54:06 PM null
INFO: Launching instance: agent-app-qght92
Dec 09, 2019 6:54:06 PM null
INFO: bootstrap
Dec 09, 2019 6:54:06 PM null
INFO: Getting keypair...
Dec 09, 2019 6:54:06 PM null
INFO: Using autogenerated keypair
Dec 09, 2019 6:54:06 PM null
INFO: Authenticating as jenkins
Dec 09, 2019 6:54:07 PM null
INFO: Connecting to 35.X.Y.Z on port 22, with timeout 10000.
Dec 09, 2019 6:54:17 PM null
INFO: Failed to connect via ssh: The kexTimeout (10000 ms) expired.
Dec 09, 2019 6:54:17 PM null
INFO: Waiting for SSH to come up. Sleeping 5.
Dec 09, 2019 6:54:22 PM null
INFO: Connecting to 35.X.Y.Z on port 22, with timeout 10000.
Dec 09, 2019 6:54:26 PM null
WARNING: Failed to verify server host key because no host key metadata was available: 404 Not Found
{
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "notFound",
    "message": "The resource 'hostkeys/' of type 'Guest Attribute' was not found."
   }
  ],
  "code": 404,
  "message": "The resource 'hostkeys/' of type 'Guest Attribute' was not found."
 }
}

Dec 09, 2019 6:54:26 PM null
INFO: Connected via SSH.
Dec 09, 2019 6:54:26 PM null
INFO: Verifying: java -fullversion
openjdk full version "1.8.0_181-8u181-b13-2~deb9u1-b13"
Dec 09, 2019 6:54:26 PM null
INFO: Copying agent.jar to: /tmp
Dec 09, 2019 6:54:27 PM null
INFO: Launching Jenkins agent via plugin SSH: java -jar /tmp/agent.jar
<===[JENKINS REMOTING CAPACITY]===>Remoting version: 3.36
This is a Unix agent
Evacuated stdout
Instance agent-app-qght92 is preemptive, setting up preemption listener
Preemptive instance, listening to metadata for preemption event
Agent successfully connected and online
ERROR: Connection terminated
java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Caused: java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)

Update: @craigdbarber

craigdbarber commented 4 years ago

Hi @mattdlh sorry to hear you're encountering issues. Could you please share some more info to help us better diagnose this:

Thanks.

craigdbarber commented 4 years ago

@jesusdiez thanks for sharing the log. Just as an FYI, the log warning: "Failed to verify server host key because no host key metadata" does not indicate a job failure. It's just the plugin letting you know that it can't verify the server's ssh host key coming from this line: https://github.com/jenkinsci/google-compute-engine-plugin/blob/a654acd5bc9d912847462782bcd2f48fd94130cd/src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineComputerLauncher.java#L429

I've created an issue to help clear this up in the log's message: #168

Similar to the above comment, could you please provide us with some more env information to help us hunt down this issue.

jesusdiez commented 4 years ago

@craigdbarber Thanks for the info! I already supposed that but didn't find a place in the config to disable that host key check. I've updated my original comment with the versions and installed plugins (all of them updated). Let me know if I can help debugging that in any way.

jesusdiez commented 4 years ago

@craigdbarber , we've been checking and the instance also gets stopped when no agent.jar process is running on it, or even when the jenkins service is stopped on master host... and also when both are happening: we forced everything trying to discover what is happening.

We're thinking about something that is being predefined on the instance creation that makes it get automatically stopped after around 7 minutes (we have no startup script on our cloud definition).

stackdriver-agent

craigdbarber commented 4 years ago

@jesusdiez could you try rolling back to a previous version to see if the issue is still occurring?

craigdbarber commented 4 years ago

Separately, I'd suggest following this troubleshooting guide on the topic to see if it helps alleviate the issue: https://wiki.jenkins.io/display/JENKINS/Remoting+issue If one of the steps they recommend does help, please do follow up here with the info.

jesusdiez commented 4 years ago

@jesusdiez could you try rolling back to a previous version to see if the issue is still occurring?

We had to recover an old snapshot of the jenkins machine. We've been very conservative on this specific plugin updates, as we've suffered other bugs in the past. We were forced to update from 1.0.10 to 4.20 because we did other non-bc (ha!) system plugins update and each execution was creating infinite agent machines (something was broken on the master-agent connectivity and it didn't detect the new instance, creating a new one). We rolled back to the 1.0.10 and the multiple instance problem was happening again, so we had to roll back the rollback.

I'll check the Remoting Issue link you provided, as it looks very related to our scenario.

mattdlh commented 4 years ago

@craigdbarber

GCE Plugin Version: 4.2.0 and we have tried downgrading to 4.1.1 with the same issue. Jenkins Version: 2.190.3 List of other installed plugins/versions:

plugins,ace-editor,ghprb,antisamy-markup-formatter,branch-api,build-monitor-plugin,build-pipeline-plugin,build-user-vars-plugin,cloudbees-folder,conditional-buildstep,copyartifact,credentials-binding,credentials,git,durable-task,external-monitor-job,gerrit-trigger,gerrit,git-client,git-server,cvs,github-api,github-branch-source,plot,github-oauth,github-organization-folder,github-pullrequest,github,greenballs,handlebars,html5-notifier-plugin,icon-shim,javadoc,jquery-detached,jquery,junit,slack,log-parser,mailer,mapdb-api,matrix-auth,matrix-combinations-parameter,matrix-project,maven-plugin,momentjs,pam-auth,parameterized-trigger,phabricator-plugin,pipeline-build-step,pipeline-input-step,pipeline-rest-api,pipeline-stage-step,pipeline-stage-view,plain-credentials,ldap,promoted-builds,publish-over-ssh,run-condition,sauce-ondemand,scm-api,scm-sync-configuration,scp,script-security,ec2,sloccount,ssh-agent,ssh-credentials,ssh-slaves,statusmonitor,structs,subversion,test-results-analyzer,thinBackup,token-macro,translation,valgrind,windows-slaves,workflow-aggregator,workflow-api,workflow-basic-steps,workflow-cps-global-lib,workflow-cps,workflow-durable-task-step,workflow-job,workflow-multibranch,workflow-scm-step,workflow-step-api,workflow-support,covcomplplot,ant,view-job-filters,jsch,robot,jira,ansicolor,audit-trail,bouncycastle-api,http-post,docker-plugin,blueocean,package-drone,cobertura,build-timeout,node-iterator-api,timestamper,aws-credentials,pipeline-milestone-step,jackson2-api,docker-build-step,pipeline-stage-tags-metadata,blueocean-jwt,pipeline-model-declarative-agent,azure-commons,urltrigger,artifactdeployer,artifact-promotion,favorite,docker-commons,blueocean-web,pipeline-model-api,pipeline-model-extensions,aws-java-sdk,ansible,pipeline-graph-analysis,docker-workflow,metrics,cloud-stats,authentication-tokens,pipeline-github-lib,nexus-artifact-uploader,display-url-api,git-parameter,envinject,build-env-propagator,pipeline-model-definition,azure-credentials,performance,accelerated-build-now-plugin,azure-vm-agents,htmlpublisher,blueocean-jira,blueocean-config,blueocean-i18n,variant,sse-gateway,command-launcher,build-timestamp,rebuild,postbuild-task,blueocean-events,benchmark,blueocean-rest,blueocean-core-js,jdk-tool,envinject-api,pubsub-light,ws-cleanup,blueocean-dashboard,lockable-resources,email-ext,blueocean-bitbucket-pipeline,job-restrictions,docker-java-api,publish-over,blueocean-git-pipeline,jenkins-design-language,file-operations,blueocean-pipeline-scm-api,blueocean-pipeline-editor,mercurial,blueocean-display-url,blueocean-pipeline-api-impl,blueocean-commons,blueocean-autofavorite,blueocean-github-pipeline,blueocean-personalization,blueocean-rest-impl,apache-httpcomponents-client-4-api,cloudbees-bitbucket-branch-source,trilead-api,google-oauth-plugin,scoring-load-balancer,oauth-credentials,google-metadata-plugin,code-coverage-api,blueocean-executor-info,clang-scanbuild,google-storage-plugin,resource-disposer,flexible-publish,any-buildstep,elastic-axis,handy-uri-templates-2-api,google-compute-engine,google-cloudbuild,
fisabelle commented 4 years ago

@craigdbarber

I am getting the same issue. I setup a staging server that was working properly but when I re-built it for production it didn't. Since I still have the staging server around, I was able to compare the plugin versions. Maybe this can help.

Jenkins ver. 2.208 The unexpected thing is that both are using GCE Plugin Version: 4.2.0

--- plugins.working     2019-12-14 08:59:13.000000000 -0500
+++ plugins.failing       2019-12-14 08:58:55.000000000 -0500
@@ -1,4 +1,4 @@
-iace-editor    1.1     true
+ace-editor     1.1     true
 ant    1.10    true
 antisamy-markup-formatter      1.6     true
 apache-httpcomponents-client-4-api     4.5.10-2.0      true
@@ -25,46 +25,49 @@
 blueocean-rest-impl    1.21.0  true
 blueocean-web  1.21.0  true
 bouncycastle-api       2.17    true
-branch-api     2.5.4   true
+branch-api     2.5.5   true
+built-on-column        1.1     true
 cisco-spark-notifier   1.1.1   true
-cloudbees-bitbucket-branch-source      2.5.0   true
-cloudbees-folder       6.9     true
-command-launcher       1.3     true
+cloudbees-bitbucket-branch-source      2.6.0   true
+cloudbees-folder       6.10.0  true
+command-launcher       1.4     true
 conditional-buildstep  1.3.6   true
 credentials    2.3.0   true
 credentials-binding    1.20    true
 display-url-api        2.3.2   true
 docker-commons 1.15    true
-docker-workflow        1.20    true
-durable-task   1.30    true
+docker-workflow        1.21    true
+durable-task   1.33    true
 email-ext      2.68    true
+envinject      2.3.0   true
+envinject-api  1.7     true
 external-monitor-job   1.7     true
 favorite       2.3.2   true
-git    3.12.1  true
-git-client     2.9.0   true
-git-server     1.8     true
-github 1.29.4  true
+git    4.0.0   true
+git-client     3.0.0   true
+git-server     1.9     true
+github 1.29.5  true
 github-api     1.95    true
 github-branch-source   2.5.8   true
 google-compute-engine  4.2.0   true
-google-metadata-plugin 0.2     true
+google-metadata-plugin 0.3.1   true
 google-oauth-plugin    1.0.0   true
 google-storage-plugin  1.5.1   true
 handlebars     1.1.1   true
-handy-uri-templates-2-api      2.1.7-1.0       true
+handy-uri-templates-2-api      2.1.8-1.0       true
 htmlpublisher  1.21    true
-jackson2-api   2.10.0  true
+jackson2-api   2.10.1  true
 javadoc        1.5     true
-jdk-tool       1.3     true
+jdk-tool       1.4     true
 jenkins-design-language        1.21.0  true
-jira   3.0.10  true
+jenkins-multijob-plugin        1.32    true
+jira   3.0.11  true
 jquery 1.12.4-1        true
 jquery-detached        1.2.1   true
-jquery-ui      1.0.2   true
 jsch   0.1.55.1        true
 junit  1.28    true
 ldap   1.21    true
-lockable-resources     2.5     true
+lockable-resources     2.7     true
 mailer 1.29    true
 mapdb-api      1.0.9.0 true
 matrix-auth    2.5     true
@@ -74,23 +77,23 @@
 momentjs       1.1.1   true
 monitoring     1.80.0  true
 nodelabelparameter     1.7.2   true
-oauth-credentials      0.3     true
+oauth-credentials      0.4     true
 packer 1.5     true
-pam-auth       1.5.1   true
+pam-auth       1.6     true
 Parameterized-Remote-Trigger   3.1.0   true
-parameterized-trigger  2.35.2  true
+parameterized-trigger  2.36    true
 periodicbackup 1.5     true
-pipeline-build-step    2.9     true
+pipeline-build-step    2.10    true
 pipeline-graph-analysis        1.10    true
 pipeline-input-step    2.11    true
 pipeline-milestone-step        1.3.1   true
-pipeline-model-api     1.3.9   true
+pipeline-model-api     1.5.0   true
 pipeline-model-declarative-agent       1.1.1   true
-pipeline-model-definition      1.3.9   true
-pipeline-model-extensions      1.3.9   true
+pipeline-model-definition      1.5.0   true
+pipeline-model-extensions      1.5.0   true
 pipeline-rest-api      2.12    true
 pipeline-stage-step    2.3     true
-pipeline-stage-tags-metadata   1.3.9   true
+pipeline-stage-tags-metadata   1.5.0   true
 pipeline-stage-view    2.12    true
 pipeline-utility-steps 2.3.1   true
 plain-credentials      1.5     true
@@ -98,7 +101,7 @@
 role-strategy  2.15    true
 run-condition  1.2     true
 scm-api        2.6.3   true
-script-security        1.66    true
+script-security        1.68    true
 sse-gateway    1.20    true
 ssh-credentials        1.18    true
 ssh-slaves     1.31.0  true
@@ -107,18 +110,18 @@
 tap    2.3     true
 test-results-analyzer  0.3.5   true
 timestamper    1.10    true
-token-macro    2.8     true
+token-macro    2.10    true
 trilead-api    1.0.5   true
 variant        1.3     true
 windows-slaves 1.5     true
 workflow-aggregator    2.6     true
-workflow-api   2.37    true
+workflow-api   2.38    true
 workflow-basic-steps   2.18    true
-workflow-cps   2.74    true
+workflow-cps   2.78    true
 workflow-cps-global-lib        2.15    true
-workflow-durable-task-step     2.34    true
-workflow-job   2.35    true
+workflow-durable-task-step     2.35    true
+workflow-job   2.36    true
 workflow-multibranch   2.21    true
 workflow-scm-step      2.9     true
-workflow-step-api      2.20    true
+workflow-step-api      2.21    true
 workflow-support       3.3     true

Another interesting detail is that the request to shutdown the VM always seems to occur at the same minutes of the hour. In my case: 2:51 PM, 3:51 PM , 4:51 PM. As a result, this is always the time of the failure:

grep "I/O" /var/log/jenkins/jenkins.log-20191214 
2019-12-13 16:51:21.454+0000 [id=1368]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-2bz2rz
2019-12-13 16:51:22.007+0000 [id=1356]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-55ur2e
2019-12-13 16:51:22.572+0000 [id=1178]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-74wjdg
2019-12-13 16:51:23.224+0000 [id=765]   INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-9j55go
2019-12-13 16:51:23.644+0000 [id=1136]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-au7ons
2019-12-13 16:51:24.357+0000 [id=1267]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-b7ob8k
2019-12-13 16:51:24.921+0000 [id=663]   INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-kt3wlk
2019-12-13 16:51:25.471+0000 [id=1369]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-lujsg9
2019-12-13 16:51:26.025+0000 [id=610]   INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-n9qyjv
2019-12-13 16:51:26.563+0000 [id=1226]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-nlgn2n
2019-12-13 17:51:21.434+0000 [id=2105]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-237zal
2019-12-13 19:51:21.477+0000 [id=3599]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-4b0fls
2019-12-13 19:51:22.098+0000 [id=2824]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-eall9c
2019-12-13 19:51:22.680+0000 [id=3427]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-ttun92
2019-12-13 19:51:23.295+0000 [id=3429]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-xcwhql
2019-12-13 20:51:21.557+0000 [id=4758]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-dx6w2m
2019-12-13 20:51:22.174+0000 [id=4783]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-fiwx95
2019-12-13 20:51:22.669+0000 [id=4925]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-hmc0sa
2019-12-13 20:51:23.210+0000 [id=4376]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-k7t5uy
2019-12-13 22:51:21.526+0000 [id=401]   INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-39ock0
2019-12-13 22:51:22.172+0000 [id=544]   INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-q646kf
2019-12-13 22:51:22.698+0000 [id=291]   INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-s4x2hw
2019-12-13 22:51:23.524+0000 [id=490]   INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-z7jhbb
2019-12-14 07:51:21.547+0000 [id=8514]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-0f9sfy
2019-12-14 07:51:22.045+0000 [id=8615]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-1ig3oi
2019-12-14 07:51:22.544+0000 [id=8654]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-2ozbzq
2019-12-14 07:51:23.232+0000 [id=8728]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gce1-slave-xpn-7g0epo
fisabelle commented 4 years ago

For the record, I found out the culprit in my installation, the cloud instanceId/jenkins_cloud_id in the config file. It was duplicate! As a result, the working instance (which was idle), ended up cleaning up the apparently orphaned slave instances.

2019-12-14 19:50:58.781+0000 [id=30821] INFO    hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished PeriodicBackup. 0 ms
2019-12-14 19:51:20.206+0000 [id=22]    INFO    c.g.j.p.c.CleanLostNodesWork#terminateInstance: Remote instance gce1-slave-xpn-rmjstp not found locally, removing it
2019-12-14 19:51:58.780+0000 [id=30829] INFO    hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started PeriodicBackup

So for me, this issue is gone and my advice is too verify your cloud config XML.

Luschgy commented 4 years ago

@isabellf could you post your config XML for comparision ?

hmeerlo commented 4 years ago

Anyone ever found a solution for this problem? It happens a lot for me, really annoying

nehaljwani commented 3 years ago

In my case, I had a test instance of Jenkins which was basically a clone and hosted behind a different end point. Although all jobs were disabled on it, the plugin was quite active and reaping off slaves. The comment by @fisabelle solved the mystery, along with https://github.com/jenkinsci/google-compute-engine-plugin/issues/46

robertauer commented 3 years ago

I am currently having the same problem. Reading the posts above I looked into my config.xml and found the following:

  <clouds>
    <com.google.jenkins.plugins.computeengine.ComputeEngineCloud plugin="google-compute-engine@4.3.3">
...
<instanceId>abcdefgh-1234-1234-1234-abcdefghijkl</instanceId>
...
          <googleLabels>
            <entry>
              <string>jenkins_cloud_id</string>
              <string>abcdefgh-1234-1234-1234-abcdefghijkl</string>
            </entry>
            <entry>
              <string>jenkins_config_name</string>
              <string>name123</string>
            </entry>
          </googleLabels>
...
    </com.google.jenkins.plugins.computeengine.ComputeEngineCloud>
  </clouds>

I guess this is what @fisabelle is talking about. Could you please tell us how you fixed it? Do I remove the <instanceId> part or the <string>jenkins_cloud_id</string> entry part? EDIT: I removed the <instanceId> line and it fixed the problem for now.

rkirkpat commented 1 year ago

I can confirm @robertauer solution above.

We were testing a new controller with a backup from our production controllers and were seeing GCP agents launched by both to run jobs being killed prematurely. Reviewing the GCP Compute Engine audit logs showed that the "other" Jenkins controllers was doing the kills (confirmed by source IP addresses of the destroy requests).

The solution was to shutdown the test controller, edit its config to remove the instanceID line as @robertauer shows in his comment above, then restart the controller. A new instance id was then generated and all the google label entries were updated with this new id as well. After doing this we had no more conflict between controllers on agents.

A feature request would then be a button in the web UI to reset this instance id, or at least a warning in the documentation about this.

johanblumenberg commented 1 month ago

I encountered the exact same issue today. VMs shut down mid build, always at the same minute of the hour. I checked the instanceId value in the configuration, and turns out I had two cloud configurations with the same ID.

Fixing the duplicate ID seems to fix the problem.

The problem started when I made a copy of a cloud configuiration. When you choose "Copy Existing Cloud" to create a new cloud configuration, it looks like it copies the entire configuration of the other cloud, including the instanceId value.

mtellezj commented 4 weeks ago

@johanblumenberg what Jenkins and GCP plugin version are you using?

johanblumenberg commented 3 weeks ago

@johanblumenberg what Jenkins and GCP plugin version are you using?

Latest at the time of writing, 4.575.v6969b_7c435eb_, Jenkins version 2.452.3.