[JENKINS-22932] Jenkins slave cannot reconnect to Master once it has been disconnected unless Jenkins is restarted

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
src/main/java/org/jenkinsci/remoting/nio/Closeables.java
http://jenkins-ci.org/commit/remoting/4bb086e15c88e2756e6c90987466a8af8c593b75
Log:
JENKINS-22932

shutdownInput/Output is not idempotent, so attempting to reclose a closed socket fails.

timja commented 10 years ago

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
http://jenkins-ci.org/commit/remoting/4228cf8ad89faba8716b10f381adcdeb1594bf0d
Log:
JENKINS-22932

Don't let a failed SelectorTask kill the selector thread.

timja commented 10 years ago

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
http://jenkins-ci.org/commit/remoting/23f817832c18cec9abc65363a0261eab3958adaf
Log:
[FIXED JENKINS-22932]

If the thread that serves NioChannelHub.run() leaves for any reason, stop accepting the new connection as the channel will never be serviced.
It is indicative of a problem in the code.

This is the 3rd and the final part of the fix to the problem.

Compare: https://github.com/jenkinsci/remoting/compare/546728f16212...23f817832c18

timja commented 10 years ago

kohsuke:

This is a regression in 1.560. Fix will be in 1.568.

timja commented 10 years ago

davidriggleman:

I'm still seeing this problem in 1.568. In my case, the slave nodes are being disconnected due to a ping timeout. Up until recently (not sure exact version but around version 1.560 sounds right), I never had any issues with the slave nodes not connecting. Here's a snippet of the logs. I can provide more info if needed.

Connection #19 failed
java.io.IOException: NioChannelHub is not currently running
at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:446)
at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52)
at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120)
at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63)
at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57)
at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:157)

timja commented 10 years ago

mkobler:

David,

You might try upgrading the slaves with the new slave.jar that comes with 1.568.

(I was running 1.561 and seeing the issue, but only with a newer version of the slave.jar. Slaves running an older version did not show the issue).

timja commented 10 years ago

davidriggleman:

Thanks Mike! That apparently was my problem. I updated the slave-agent.jnlp file yesterday and haven't had any issues since. I didn't realize I needed to update the slaves as well as I thought the bug was primarily a server issue.

timja commented 10 years ago

Looks like a regression, or sporadic issue. I'm experiencing this now.

In our environment, ubuntu master, windows slaves.
Jenkins: 1.572 slave.jar version: 2.43 (the version on the master)

java.io.IOException: NioChannelHub is not currently running
at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:446)
at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52)
at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120)
at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63)
at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57)
at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:157)

In our environment slaves get disconnected after a suite of tests complete (and revert to a clean vSphere snapshot) and then reconnect.

It runs fine for a while, with many disconnects/reconnects.
Then starts tossing these exceptions and nothing can connect until a restart.

timja commented 10 years ago

I also have the same problem as Patricia with Jenkins 1.571. Slave.jar 2.43. Very frustrating.

timja commented 10 years ago

It ran for four days, slaves successfully disconnecting and reconnecting, then the problem surfaced again at about 5am this morning.

Many jobs were running, and all slaves disconnected at once.
Restarting Jenkins brought everything back online.

timja commented 10 years ago

Wondering if we should reopen? (not sure what the Jenkins JIRA process is)
Anyways I've restarted my Jenkins too and am hoping for the best. I did (very briefly) look at the source (http://git.io/a4WelA) and am wondering why it bothers to throw an exception there instead of just making a new selector (presumably with an atomic get-or-creator or something) however I'd probably need to look at a lot more code before I can say I know what's going on with this.

timja commented 10 years ago

A few of us still have this issue with very recent Jenkins versions.

timja commented 10 years ago

My feeling is that this happens for us if we power off a JNLP slave in an ungraceful way (eg pull the virtual power plug), however it doesn't seem to happen every time, again this is similar to Patricia's case. Actually, I'm a bit surprised this doesn't happen at CloudBees since FWIU is that they've got dynamic provisioned VMs too; or maybe they use the LTS?
Anyways I'll try to grab the jenkins logs the next time this happens.

timja commented 10 years ago

apham:

I also hit the same problem as Patricia and Kevin with Jenkins 1.571. Slave.jar 2.37. Master on Red Hat Enterprise Linux Server release 6.5 and slave on a Windows 7 VM. Will try to restart Jenkins and see if that helps.

timja commented 10 years ago

jglick:

Are you actually seeing the same bug introduced in 1.560 and purportedly fixed in 1.568, or some other bug with related symptoms that should be filed separately?

timja commented 10 years ago

At first blush the error appeared to be the same because the trace is very similar, the nodes show the same thing and the symptoms are the same; but on closer investigation the root cause looks like it's somewhat different than this.

Specifically:
...omitted for brevity...
Caused by: java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
at java.nio.channels.SelectionKey.isReadable(SelectionKey.java:289)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:513)
... 6 more
As opposed to the ClosedChannelException.

I'll file another issue.

timja commented 10 years ago

Actually I think what Andy, Patricia and myself are seeing has a separate root cause and is not a regression of this issue per se. See https://issues.jenkins-ci.org/browse/JENKINS-24050

timja commented 10 years ago

apham:

Here is the stack I currently get.

<===[JENKINS REMOTING CAPACITY]===>Failed to establish the connection with the slave wdctp707
java.io.IOException: NioChannelHub is not currently running
at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:446)
at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52)
at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120)
at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63)
at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57)
at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:156)

Trying to connect windows 7 VM slave for the first time. This also appears on the console:

"Ping response time is too long or timed out."

timja commented 10 years ago

@Andy did you happen to see if there were any jobs that were running at the time that died? If so did the jobs have a line containing "Caused by: java.nio.channels.ClosedChannelException" (what this issue looks to have fixed) or "Caused by: java.nio.channels.ClosedChannelException" (issue I raised in ~~JENKINS-24050~~); basically the slave log doesn't really tell you enough information for these two issues.

Additionally did your other JNLP slaves disconnect?

If none of the above are true then I think what you've got might be a different issue than this or ~~JENKINS-24050~~ (and I guess it should be filled separately).

timja commented 10 years ago

apham:

Kevin, I haven't had a chance to catch the issue when a job is running yet and things still seem working since my last Jenkins restart. I'll keep an eye out for it. It could be a different problem and once I'm able to confirm that I'll log a different defect.

timja commented 10 years ago

phiche:

I have this this issue multiple times using in the last few days the swarm plugin on Jenkins ver 1.586.

The jenkins master is a brand new server running RHEL 6.5. The slave is also RHEL 6.5. No jobs have previously run on it. The only thing I'm testing is the connection of news slaves. This is a stochastic issue however. Sometimes it works fine, and sometimes it results in this error (I guess maybe 30% of the time).

JNLP agent connected from /172.31.8.131
<===[JENKINS REMOTING CAPACITY]===>Failed to establish the connection with the slave dev-master.phil
java.io.IOException: NioChannelHub is not currently running
at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:479)
at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52)
at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120)
at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63)
at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57)
at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:156)

timja commented 10 years ago

I've seen this happen even after recent fixes.

Jobs were running on the slaves at the time..
One thing I noticed - linux slaves didn't disconnect, only windows slaves.

This was too big of a problem, I went back to LTS.

timja commented 10 years ago

phiche:

I downgraded to the LTS Jenkins ver. 1.580.1 today and just hit the same problem.

timja commented 9 years ago

guillaume31:

Same behavior between slaves and master running both on Windows 7sp1 / 2008R2 with Jenkins 1.592

timja commented 9 years ago

redoranges:

I'm seeing the same, master is 2008R2, slave is 7sp1
Jenkins ver. 1.580.2

timja commented 9 years ago

Even after running on LTS I still see this every week.

timja commented 9 years ago

bcygan:

Server: 1.590 with swarm client plugin 1.15

Client: described problem occurs with swarm-client-1.20-jar-with-dependencies.jar, but not with swarm-client-1.15-jar-with-dependencies.jar

timja commented 9 years ago

kerrhome:

Same issue. Jenkins 1.574. Server Host is Ubuntu 12.04. Slave is Windows 7 x64 VM.

timja commented 9 years ago

hanabishi:

Have the same problem with Jenkins LTS 1.580.3. In our case the nodes goes offline a few hours after restarting the master server and it's not all node, just a few each time (different nodes each time).

The server is running on Ubuntu 14.04 and the slaves are running Windows 7 x64

Connection was broken
java.io.EOFException
at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:616)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:111)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

timja commented 9 years ago

kohsuke:

Exceptions that say "NioChannelHub is not currently running", we are expecting a nested exception. Please attach the full stack trace including all the "Caused by ..." sections, not just the top-most part of it.

timja commented 9 years ago