jenkinsci / openstack-cloud-plugin

Provision nodes from OpenStack on demand
https://plugins.jenkins.io/openstack-cloud
MIT License
47 stars 83 forks source link

Agents are left offline on Jenkins restart #217

Open balous opened 6 years ago

balous commented 6 years ago

Every time I restart Jenkins (i.e. for update), existing agents are not relaunched. I need to delete them or launch them manually.

Agent log contains following error:

[09/04/18 06:22:30] [SSH] Opening SSH connection to 10.64.120.18:22.
Key exchange was not finished, connection is closed.
java.io.IOException: There was a problem while connecting to 10.64.120.18:22
    at com.trilead.ssh2.Connection.connect(Connection.java:834)
    at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1324)
    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:831)
    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:820)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Key exchange was not finished, connection is closed.
    at com.trilead.ssh2.transport.KexManager.getOrWaitForConnectionInfo(KexManager.java:95)
    at com.trilead.ssh2.transport.TransportManager.getConnectionInfo(TransportManager.java:237)
    at com.trilead.ssh2.Connection.connect(Connection.java:786)
    ... 7 more
Caused by: java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    at com.trilead.ssh2.crypto.cipher.CipherOutputStream.flush(CipherOutputStream.java:75)
    at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:193)
    at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107)
    at com.trilead.ssh2.transport.TransportManager.sendKexMessage(TransportManager.java:587)
    at com.trilead.ssh2.transport.KexManager.handleMessage(KexManager.java:431)
    at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:790)
    at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:502)
    ... 1 more
[09/04/18 06:27:54] Launch failed - cleaning up connection
[09/04/18 06:27:54] [SSH] Connection closed.
Launcher failed to bring the node online. Retrying ...
Launcher failed to bring the node online within timeout.

I've found a solution to similar problem that applies to manually created agents. I guess, while not directly applicable to dynamically created agents, it gives a notion of what the problem could be.

olivergondza commented 6 years ago

Am I correct to understand that the issue manifests consistently (no nodes that has survived restart are able to autolaunch) and merely relaunching the agent brings it back online?

Also, the KB article is likely unrelated - it describes the situation when Jenkins refuses to talk to untrusted host while here it appears the socket gets closed for some reason.

balous commented 6 years ago

Yes, this appears constantly, however, I can't tell it is 100% of all cases as I don't notice cases when it works. And yes, mere relaunch fixes it.

olivergondza commented 6 years ago

I noticed there is 5 minute delay in ssh connection logs around the exception. I suspect the connection keepalive timeout expires closing the socket. The question is why the initial connection takes so long.

balous commented 6 years ago

I've tried with Jenkins with just two agents - one in openstack and one statically configured. Just a jenkins servise restart was sufficient to trigger the problem. But I notticed that both agents suffered from the problem. So I think we should close this issue as it is probably not related to Jenkins OS plugin.

balous commented 6 years ago

Still fiddling with that. After restart, both agents had problems connecting - timeout and exception are the same. But I have never noticed the problem for the static agent as there is one difference - the static agent is attempted to reconnect with success after second try. Openstack agent is just left disconnected.

olivergondza commented 6 years ago

Is the static node configured to reconnect? How many times with what delay?

balous commented 6 years ago

Connection Timeout in Seconds is empty, Maximum Number of Retries is 0 and Seconds To Wait Between Retries is 0. I don't see any other options that configure reconnect.

olivergondza commented 6 years ago

In openstack cloud plugin, this is currently hardcoded to 5 retries after 15 seconds. We needed some finite numbers to prevent it hanging forever in case it never succeeds due to misconfigured slave, for instance. Chances are it is not enough for you and your static agent configured for 0 retries (retry forever) succeeds after trying for a longer time. So how long does your static slave have to retry before it succeeds, you should find that in the slave log(s)? I speculate openstack nodes might give up too soon for your setup and in that case I would be interested why it takes so long for your master to make ssh connections...

balous commented 6 years ago

What slave logs do you mean? /var/log/auth.log?

olivergondza commented 6 years ago

@balous, sorry, I meant JENKINS_HOME/logs/slaves/SLAVE_NAME/slave.log* files.

balous commented 6 years ago

There is just on file and it contains exactly the same content I've seen in Jenkins GUI and posted in the opening comment.