Open balous opened 6 years ago
Am I correct to understand that the issue manifests consistently (no nodes that has survived restart are able to autolaunch) and merely relaunching the agent brings it back online?
Also, the KB article is likely unrelated - it describes the situation when Jenkins refuses to talk to untrusted host while here it appears the socket gets closed for some reason.
Yes, this appears constantly, however, I can't tell it is 100% of all cases as I don't notice cases when it works. And yes, mere relaunch fixes it.
I noticed there is 5 minute delay in ssh connection logs around the exception. I suspect the connection keepalive timeout expires closing the socket. The question is why the initial connection takes so long.
/proc/sys/kernel/random/entropy_avail
content at the time of the incident with the normal state to confirm. Alternatively, try restarting jenkins with different number of OS slaves to autolaunch verifying it is their quantity that causes/contributes to the issue.I've tried with Jenkins with just two agents - one in openstack and one statically configured. Just a jenkins servise restart was sufficient to trigger the problem. But I notticed that both agents suffered from the problem. So I think we should close this issue as it is probably not related to Jenkins OS plugin.
Still fiddling with that. After restart, both agents had problems connecting - timeout and exception are the same. But I have never noticed the problem for the static agent as there is one difference - the static agent is attempted to reconnect with success after second try. Openstack agent is just left disconnected.
Is the static node configured to reconnect? How many times with what delay?
Connection Timeout in Seconds
is empty, Maximum Number of Retries
is 0 and Seconds To Wait Between Retries
is 0. I don't see any other options that configure reconnect.
In openstack cloud plugin, this is currently hardcoded to 5 retries after 15 seconds. We needed some finite numbers to prevent it hanging forever in case it never succeeds due to misconfigured slave, for instance. Chances are it is not enough for you and your static agent configured for 0 retries (retry forever) succeeds after trying for a longer time. So how long does your static slave have to retry before it succeeds, you should find that in the slave log(s)? I speculate openstack nodes might give up too soon for your setup and in that case I would be interested why it takes so long for your master to make ssh connections...
What slave logs do you mean? /var/log/auth.log?
@balous, sorry, I meant JENKINS_HOME/logs/slaves/SLAVE_NAME/slave.log*
files.
There is just on file and it contains exactly the same content I've seen in Jenkins GUI and posted in the opening comment.
Every time I restart Jenkins (i.e. for update), existing agents are not relaunched. I need to delete them or launch them manually.
Agent log contains following error:
I've found a solution to similar problem that applies to manually created agents. I guess, while not directly applicable to dynamically created agents, it gives a notion of what the problem could be.