jenkinsci / google-compute-engine-plugin

https://plugins.jenkins.io/google-compute-engine/
Apache License 2.0
57 stars 86 forks source link

Listener for preemption event stops working after a while #172

Open androa opened 4 years ago

androa commented 4 years ago

We are using preemptive nodes. We experience that Jenkins abruptly looses connection to the node when it is preempted.

We are also having a lot of messages like this:

Jan 09, 2020 11:19:42 AM WARNING com.google.jenkins.plugins.computeengine.ComputeEngineComputer getPreempted
Error when getting preempted status
Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to jenkins-highcpu-cihwen
        at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743)
        at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
        at hudson.remoting.Channel.call(Channel.java:957)
        at com.google.jenkins.plugins.computeengine.ComputeEngineComputer.getPreemptedStatus(ComputeEngineComputer.java:62)
        at com.google.jenkins.plugins.computeengine.ComputeEngineComputer.lambda$onConnected$0(ComputeEngineComputer.java:55)
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
        at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
        at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
    at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:292)
    at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:351)
    at java.base/sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:746)
    at java.base/sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:689)
    at java.base/sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:717)
    at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1610)
    at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1515)
    at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527)
    at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
    at com.google.jenkins.plugins.computeengine.PreemptedCheckCallable.call(PreemptedCheckCallable.java:74)
    at com.google.jenkins.plugins.computeengine.PreemptedCheckCallable.call(PreemptedCheckCallable.java:34)
    at hudson.remoting.UserRequest.perform(UserRequest.java:212)
    at hudson.remoting.UserRequest.perform(UserRequest.java:54)
    at hudson.remoting.Request$2.run(Request.java:369)
    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Caused: java.lang.RuntimeException
    at com.google.jenkins.plugins.computeengine.ComputeEngineComputer.getPreemptedStatus(ComputeEngineComputer.java:71)
    at com.google.jenkins.plugins.computeengine.ComputeEngineComputer.lambda$onConnected$0(ComputeEngineComputer.java:55)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused: java.util.concurrent.ExecutionException
    at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
    at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
    at com.google.jenkins.plugins.computeengine.ComputeEngineComputer.getPreempted(ComputeEngineComputer.java:102)
    at com.google.jenkins.plugins.computeengine.ComputeEngineRetentionStrategy.wasPreempted(ComputeEngineRetentionStrategy.java:113)
    at com.google.jenkins.plugins.computeengine.ComputeEngineRetentionStrategy.taskCompleted(ComputeEngineRetentionStrategy.java:83)
    at hudson.slaves.SlaveComputer.taskCompleted(SlaveComputer.java:355)
    at hudson.model.queue.WorkUnitContext.synchronizeEnd(WorkUnitContext.java:140)
    at hudson.model.Executor.finish1(Executor.java:477)
    at hudson.model.Executor.run(Executor.java:451)

They seem to be repeating every 15 minutes, and seems to be caused by the scheduled retention strategy.

Having followed the code I believe what's happening is this: Something (not sure quite what) causes the long polling request towards metadata server to die (with a connection reset).

I've tried various ways of figuring out what causes the connection to be reset without really finding anything. I've not been able to reproduce it by executing the same call with cURL either through a Jenkins job, or directly in a shell on the machine.

Whatever the cause might turn out to be, I believe this should be handled differently by Jenkins.

I would expect it to either mark the node as dying and clean up, or set up the listener again.

hmeerlo commented 4 years ago

Ok, I have exactly the same problems. It is very annoying because the logs show that the preemption was detected and it claims it will abort the job but that never happens. So it stays in the list of active jobs and the queue with pending jobs fills up and they never get scheduled because there are no free resources. It always requires us to abort the jobs manually. And it feels like the GCE instances are getting preempted faster lately so the problem occurs more often every day.

cytar commented 4 years ago

Hello, we have exactly the same problem. Can someone take it in account. Maybe it could be useful to add a customizable timer to say "the node is gone for X seconds, so lets restart the builds that was attached to it, and kill the slave" ?

For those interested, that's what i'm doing in groovy:

//RESTART JOB BUILD IF STILL RUNNING AND CONTAINS SLAVE CONNEXION ERROR def buildingJobs = Jenkins.instance.getAllItems(Job.class).findAll { it.isBuilding() }

buildingJobs.each { job-> build = job.getLastBuild(); if (build.getLog().contains("Cannot contact ")) { println('We must reschedule job: ' + job.name); println('build ' + build); def oldPa = build.getAction(ParametersAction.class); def cause = new hudson.model.Cause.UpstreamCause(build); def causeAction = new hudson.model.CauseAction(cause); job.scheduleBuild2(0, causeAction, oldPa); build.doKill(); } }

// NOW KILL ALL SLAVES OFFLINE FOR TOO LONG time=new Date(); epoch_milis = time.getTime(); for (aSlave in hudson.model.Hudson.instance.slaves) { ct = aSlave.getComputer().getConnectTime(); diff_min = (epoch_milis - ct)/60000; if (aSlave.getComputer().isOffline() && diff_min >= 5) { println("==================================================================="); println('We must delete (offline for ' + diff_min + ' min): ' + aSlave.name); Jenkins.instance.removeNode(aSlave); } }

hunter86bg commented 6 months ago

Any idea how to restart the preemption listener ? I see this issue is very old and yet it has no indication it will be fixed.

hunter86bg commented 6 months ago

@androa , @cytar , have you find a fix for this ? So far I am just using 'journalctl' to identify the problematic nodes, create a list of the workers and the matched ones are set to 'setTemporarilyOffline()'. After that, I just cleanup any node that is offline and empty ( (node.getComputer().isTemporarilyOffline().toBoolean()) && (node.getComputer().countBusy() == 0) ). This way my builds are not stopped and the schedules are not interrupted.