Agents are left offline on Jenkins restart

balous commented 6 years ago

Every time I restart Jenkins (i.e. for update), existing agents are not relaunched. I need to delete them or launch them manually.

Agent log contains following error:

[09/04/18 06:22:30] [SSH] Opening SSH connection to 10.64.120.18:22.
Key exchange was not finished, connection is closed.
java.io.IOException: There was a problem while connecting to 10.64.120.18:22
    at com.trilead.ssh2.Connection.connect(Connection.java:834)
    at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1324)
    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:831)
    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:820)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Key exchange was not finished, connection is closed.
    at com.trilead.ssh2.transport.KexManager.getOrWaitForConnectionInfo(KexManager.java:95)
    at com.trilead.ssh2.transport.TransportManager.getConnectionInfo(TransportManager.java:237)
    at com.trilead.ssh2.Connection.connect(Connection.java:786)
    ... 7 more
Caused by: java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    at com.trilead.ssh2.crypto.cipher.CipherOutputStream.flush(CipherOutputStream.java:75)
    at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:193)
    at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107)
    at com.trilead.ssh2.transport.TransportManager.sendKexMessage(TransportManager.java:587)
    at com.trilead.ssh2.transport.KexManager.handleMessage(KexManager.java:431)
    at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:790)
    at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:502)
    ... 1 more
[09/04/18 06:27:54] Launch failed - cleaning up connection
[09/04/18 06:27:54] [SSH] Connection closed.
Launcher failed to bring the node online. Retrying ...
Launcher failed to bring the node online within timeout.

I've found a solution to similar problem that applies to manually created agents. I guess, while not directly applicable to dynamically created agents, it gives a notion of what the problem could be.

olivergondza commented 6 years ago

Am I correct to understand that the issue manifests consistently (no nodes that has survived restart are able to autolaunch) and merely relaunching the agent brings it back online?

Also, the KB article is likely unrelated - it describes the situation when Jenkins refuses to talk to untrusted host while here it appears the socket gets closed for some reason.

balous commented 6 years ago

Yes, this appears constantly, however, I can't tell it is 100% of all cases as I don't notice cases when it works. And yes, mere relaunch fixes it.

olivergondza commented 6 years ago

I noticed there is 5 minute delay in ssh connection logs around the exception. I suspect the connection keepalive timeout expires closing the socket. The question is why the initial connection takes so long.

Can you somehow reproduce by native ssh client?
Can you isolate the problem to either Jenkins master or slave side?
Wild guess: can it be the number of slaves launched after restart is large and initiating so many ssh connection at the same time it drains your Jenkins master entropy pool slowing down the exchange until enough entropy is accumulated again which happens to take longer then your slaves are willing to wait for ssh connection to be established? Compare /proc/sys/kernel/random/entropy_avail content at the time of the incident with the normal state to confirm. Alternatively, try restarting jenkins with different number of OS slaves to autolaunch verifying it is their quantity that causes/contributes to the issue.

balous commented 6 years ago

I've tried with Jenkins with just two agents - one in openstack and one statically configured. Just a jenkins servise restart was sufficient to trigger the problem. But I notticed that both agents suffered from the problem. So I think we should close this issue as it is probably not related to Jenkins OS plugin.

balous commented 6 years ago

Still fiddling with that. After restart, both agents had problems connecting - timeout and exception are the same. But I have never noticed the problem for the static agent as there is one difference - the static agent is attempted to reconnect with success after second try. Openstack agent is just left disconnected.

olivergondza commented 6 years ago

Is the static node configured to reconnect? How many times with what delay?

balous commented 6 years ago

Connection Timeout in Seconds is empty, Maximum Number of Retries is 0 and Seconds To Wait Between Retries is 0. I don't see any other options that configure reconnect.

olivergondza commented 6 years ago

In openstack cloud plugin, this is currently hardcoded to 5 retries after 15 seconds. We needed some finite numbers to prevent it hanging forever in case it never succeeds due to misconfigured slave, for instance. Chances are it is not enough for you and your static agent configured for 0 retries (retry forever) succeeds after trying for a longer time. So how long does your static slave have to retry before it succeeds, you should find that in the slave log(s)? I speculate openstack nodes might give up too soon for your setup and in that case I would be interested why it takes so long for your master to make ssh connections...

balous commented 6 years ago

What slave logs do you mean? /var/log/auth.log?

olivergondza commented 6 years ago

@balous, sorry, I meant JENKINS_HOME/logs/slaves/SLAVE_NAME/slave.log* files.

balous commented 6 years ago

There is just on file and it contains exactly the same content I've seen in Jenkins GUI and posted in the opening comment.

jenkinsci / openstack-cloud-plugin

Agents are left offline on Jenkins restart #217