timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-45290] windows slaves stops being able to connect to master #9225

Open timja opened 7 years ago

timja commented 7 years ago

After a while (how long seems to be random, but between days to ~4 weeks after jenkins is started) windows slaves stop being able to connect to jenkins master.
After a restart of the jenkins service, all slaves can connect without issues. When this issue occur, all existing connections continue to work.
There are no issues marking nodes as temporarily offline. But if I reboot a windows slave it can't connect at all until the jenkins master is restarted.
We tried connecting the slave to another jenkins master (we copied all data from our production jenkins so we got an exact clone) and the slave has no problem connecting
to the staging jenkins master. We have seen this issue on that server as well.
The only way for us to resolve this is to restart the jenkins service on the jenkins master. As it starts all jenkins slaves connect automatically and work again.
This isn't a good solution as we have to abort all jobs in order to restart the service. And for a production server that leads to many (very) unhappy users.

jenkins-slave.err.log: From launching the java web start until the problem occur. It just hangs there, doing nothing.
jul 04, 2017 1:46:24 EM hudson.remoting.jnlp.Main createEngine
INFO: Setting up slave: jenkins_slave
jul 04, 2017 1:46:24 EM hudson.remoting.jnlp.Main$CuiListener
INFO: Jenkins agent is running in headless mode.
jul 04, 2017 1:46:24 EM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among http://jenkins_master.company.domain/, http://jenkins_master.company.domain:8080/
jul 04, 2017 1:46:25 EM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, JNLP-connect, Ping, JNLP2-connect]
jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
Agent address: jenkins_master.company.domain
Agent port: 49187
Identity: 3d:69:2e:de:e9:84:8b:2b:fd:7b:ad:8c:00:ea:cb:32
jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins_master.company.domain:49187
jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect

 

Output from "java -jar slave.jar -jnlpUrl http://jenkins_master.company.domain/computer/jenkins_slave/slave-agent.jnlp -secret very_secret_hex_string":
jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main createEngine
INFO: Setting up slave: jenkins_slave
jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener
INFO: Jenkins agent is running in headless mode.
jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among http://jenkins_master.company.domain/
jul 04, 2017 2:10:18 EM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, JNLP-connect, Ping, JNLP2-connect]
jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
Agent address: jenkins_master.company.domain
Agent port: 49187
Identity: 3d:69:2e:de:e9:84:8b:2b:fd:7b:ad:8c:00:ea:cb:32
jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins_master.company.domain:49187
jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect

tcpdump from master side over the course of 5 minutes:
root@jenkins_master:~# date; tcpdump -nnvvv src slave_IP and dst port 49187 -s0 -vv -X -c 1000; date
Tue Jul 4 13:51:00 CEST 2017
tcpdump: listening on ens5f0, link-type EN10MB (Ethernet), capture size 262144 bytes
13:51:26.751986 IP (tos 0x0, ttl 128, id 24431, offset 0, flags [DF], proto TCP (6), length 52)
slave_IP.58214 > master_IP.49187: Flags [S], cksum 0xec23 (correct), seq 4123494006, win 8192, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
0x0000: 4500 0034 5f6f 4000 8006 5831 0a21 14a6 E..4_o@...X1.!..
0x0010: 0a21 1a3c e366 c023 f5c7 8676 0000 0000 .!.<.f.#...v....
0x0020: 8002 2000 ec23 0000 0204 05b4 0103 0308 .....#..........
0x0030: 0101 0402 ....
13:51:26.752189 IP (tos 0x0, ttl 128, id 24432, offset 0, flags [DF], proto TCP (6), length 40)
slave_IP.58214 > master_IP.49187: Flags [.], cksum 0xa8dd (correct), seq 4123494007, ack 2362185277, win 256, length 0
0x0000: 4500 0028 5f70 4000 8006 583c 0a21 14a6 E..(_p@...X<.!..
0x0010: 0a21 1a3c e366 c023 f5c7 8677 8ccc 163d .!.<.f.#...w...=
0x0020: 5010 0100 a8dd 0000 0000 0000 0000 P.............
13:51:26.803856 IP (tos 0x0, ttl 128, id 24433, offset 0, flags [DF], proto TCP (6), length 64)
slave_IP.58214 > master_IP.49187: Flags [P.], cksum 0xc27a (correct), seq 0:24, ack 1, win 256, length 24
0x0000: 4500 0040 5f71 4000 8006 5823 0a21 14a6 E..@_q@...X#.!..
0x0010: 0a21 1a3c e366 c023 f5c7 8677 8ccc 163d .!.<.f.#...w...=
0x0020: 5018 0100 c27a 0000 0016 5072 6f74 6f63 P....z....Protoc
0x0030: 6f6c 3a4a 4e4c 5034 2d63 6f6e 6e65 6374 ol:JNLP4-connect
13:51:26.804301 IP (tos 0x0, ttl 128, id 24434, offset 0, flags [DF], proto TCP (6), length 45)
slave_IP.58214 > master_IP.49187: Flags [P.], cksum 0x1c72 (correct), seq 24:29, ack 1, win 256, length 5
0x0000: 4500 002d 5f72 4000 8006 5835 0a21 14a6 E..-_r@...X5.!..
0x0010: 0a21 1a3c e366 c023 f5c7 868f 8ccc 163d .!.<.f.#.......=
0x0020: 5018 0100 1c72 0000 0003 4143 4b00 P....r....ACK.
4 packets captured
Tue Jul 4 13:55:57 CEST 2017

So it seems that the slave can try to connect, but the master refuses to respond.

We have tried asking google for help, but without luck

We run the slave as a windows service, but we also see the same issue via Java Web Start, and the command line suggested by jenkins
(java -jar slave.jar -jnlpUrl http://jenkins_master.company.domain/computer/jenkins_slave/slave-agent.jnlp -secret very_secret_hex_string)

Since we see the same issue when starting the slave as a windows service, java web start and commandline we're happy to support you on the way that is easiest to troubleshoot.

 

We're happy to provide additional information if need be.


Originally reported by ekerhag, imported from: windows slaves stops being able to connect to master
  • status: Open
  • priority: Critical
  • resolution: Unresolved
  • imported: 2022/01/10
timja commented 7 years ago

oleg_nenashev:

I need stactraces from both Jenkins Master and the agent for the time when the connection hangs.

timja commented 7 years ago

ekerhag:

I have attached the log files.

timja commented 7 years ago

oleg_nenashev:

Have you or your IT applied any security configuration changes on the machine? "An established connection was aborted by the software in your host machine" from antivirus or Firewall. So this is a system issue.

E.g. see https://social.msdn.microsoft.com/Forums/en-US/212feb8f-cf96-4561-9953-40a21d21ea47/exception-an-established-connection-was-aborted-by-the-software-in-your-host-machine?forum=netfxnetcom

timja commented 7 years ago

ekerhag:

That is indeed interesting, but doesn't seem to overlap with the problems we see.

After the issue occur, none of the windows slaves can connect from being in a disconnected state. We have 50+ windows slaves on this jenkins master.

If we restart the jenkins master, the problem goes away on all slaves. It seems very improbable to me that this would be a system issue based on that information.

We can also direct the slaves to a different master and they have no trouble connecting to that one.

I pasted the TCPdump in the description of this issue, showing all traffic sent over 5 minutes. The master received 4 packets, but sent none.

timja commented 7 years ago

oleg_nenashev:

In such case the connection termination probably makes the connection receiver to go into a bogus state. Curious thing is that it happens for both JNLP2 and JNLP4 which are supposed to use different NIO hubs. Would it be possible to get logs before 3:06 PM, probably there is a root cause error there. I would also appreciate a Master's thread dump for the time the agents cannot connect.

 

I also see another minor issue in REST API, will create a separate defect for it:

java.lang.Thread.run(Thread.java:748) Caused by: org.kohsuke.stapler.export.NotExportableException: class java.nio.channels.ClosedChannelException doesn't have @ExportedBean so cannot write hudson.slaves.OfflineCause$ChannelTermination.cause at org.kohsuke.stapler.export.Model.(Model.java:79) at org.kohsuke.stapler.export.ModelBuilder.get(ModelBuilder.java:51) at

timja commented 7 years ago

ekerhag:

We have logs going back to January. I haven't tried, but I expect that the current log (from Jul 2, 2017 06.35 AM) is too large to just attach to this ticket. Uncompressed it is 24GB, compressed it should be between 2-3 GB.

How should I send this log file to you?

Can you please explain, or point me to documentation, on how to obtain the Master's thread dump?

timja commented 7 years ago

ekerhag:

We have been forced to restart jenkins. As I mentioned earlier, this works as a temporary workaround since all windows slaves are able to connect again.

We will try to set up monitoring for the next time this issue occur so we have less logs to dig through. Are there any logs from a troubleshooting perspective that would be interesting to follow where we can't look up the information afterwards?

timja commented 7 years ago

oleg_nenashev:

> Can you please explain, or point me to documentation, on how to obtain the Master's thread dump?

You can use low-level tools like jstack, but I would rather recommend generating a support bundle using https://wiki.jenkins.io/display/JENKINS/Support+Core+Plugin

timja commented 7 years ago

ekerhag:

I will install the support core plugin on Monday (July 10th).

timja commented 7 years ago

ekerhag:

Support Core Plugin is now installed

timja commented 7 years ago

oleg_nenashev:

Waiting for the bundles then, thanks!

timja commented 7 years ago

oleg_nenashev:

ping

timja commented 7 years ago

ekerhag:

Apologies for the late reply, I have been on vacation.

 

We are still waiting on the issue to reappear so we can generate the support bundles for you.

timja commented 7 years ago

ronen_tef:

Any update on this one? we are facing same issue here with Win Slaves.

timja commented 7 years ago

oleg_nenashev:

I cannot investigate the issue without master/agent threaddumps

ronen_tef if you are ready to provide them, it would help

timja commented 7 years ago

ekerhag:

I have not experienced the issue since I installed the support core plugin.

I'm waiting for the issue too occur again so I can provide the logs necessary.

timja commented 7 years ago

ronen_tef:

oleg_nenashev could you please let me know how to take those threaddumps?

timja commented 7 years ago

ronen_tef:

Okay, I'm attaching 2 outputs of the Master thread dump,

one is via Jenkins console and one using java jconsole directly from the unix command line.

As for the Win slave I can't take any as the proccess even not started, the only thing I could provide is an exeption output.

timja commented 7 years ago

ronen_tef:

Any progress/estimation for a fix?

timja commented 7 years ago

oleg_nenashev:

ronen_tef an agent log would be useful just to have a full snapshots. If this is "java.io.IOException: An established connection was aborted by the software in your host machine" again, likely it's a whatever issue in your system.

I cannot provide any ETA until the bug is triaged at least. But I would not expect the fix in October even if the issue is confirmed (unless it's a quick fix).

timja commented 7 years ago

ronen_tef:

oleg_nenashev as already commented, I can't provide any agent log as the agent is not even started therfore a log is not created.

"...again, likely it's a whatever issue in your system...." in this case, how could it be that the same is not occur with ver. 2.46.1 ? just once upgrading to ver. 2.78.

The behavier should be consistent I assume.

timja commented 7 years ago

oleg_nenashev:

ronen_tef Currently I cannot  confirm you have the same issue as ekerhag. If you are sure it's the same, you should be able to provide agent connection stdout/stderr at least (from either slave.jar STDOUT or wrapper logs).

 

timja commented 6 years ago

oleg_nenashev:

Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

timja commented 4 years ago

stevekipping:

The same issue has started happening for me between a Windows master and Windows slave. Nothing we do with the slave fixes it, but restarting the master service allows the connection to be made and it works again.

Jenkins master 2.190.1, Java  1.8.0_144