Closed d3xt3r01 closed 6 years ago
Tried setting tcp_keepalive: True on master ( I know it's true by default but thought to give it a shot ) with no luck. The minion sees the connection with keep alive. The master does not. I'm sure noone's messing with the packets. I applied the same settings on all the minions. It's this azure box that's giving me headaches.
# on master
tcp 0 0 x.x.1.95:4505 x.x.196.12:51714 ESTABLISHED 28096/python2.7 off (0.00/0/0)
tcp 0 0 x.x.1.95:4505 x.x.196.12:51645 ESTABLISHED 28096/python2.7 off (0.00/0/0)
# on minion
tcp 0 0 x.x.0.5:51714 x.x.171.49:4505 ESTABLISHED 51787/python keepalive (12.51/0/2)
We hear all the time about this sort of problem inside Azure. It sounds like they have some networking equipment that might be causing these sorts of issues. It's hard to pin this down to an issue with Salt, given that it's only one side which thinks that the TCP connection is established.
Any workarounds of some sort ? In my case, only 1 minion is in azure.
Can the master have keepalives then ? That should close useless connections ...
Something like in master
# in /usr/lib/python2.7/site-packages/salt
context = zmq.Context(1)
context.setsockopt(zmq.TCP_KEEPALIVE,1) ###this to be added
Ofcourse .. this should be configurable just like in the minion.
I'm not sure why keepalives would make this problem any better. If an intermediate device is holding open the connection, wouldn't a keepalive only make this problem worse?
At least the master would know to close the connection. Also, if that would be customizable, we could lower the settings maybe under a threshold that azure's nat tables wouldn't expire...
Still, the keepalive option should be available on the master too just for consistency.
Also, I don't think an intermediate device is keeping the connection open. What I think happens is:
Minion sends the keepalive ..too late Somewhere nat tables expired Master doesn't get it .. but given he doesn't have keepalive on the tcp connection, he doesn't care if the minion died, rebooted or doesn't exist anymore... Minion sees that it can't ping the master, restarts a new connection 2 connections appear on the master
I tested this by putting a tcpdump on the master on a socket that was supposed to be dead ( but the master saw as alive ). Nothing was going through there .. so the master ( without the keepalive ) couldn't know that the socket died for some reason on the other end.
The cycle repeats and tons of connections get open on the master.
@d3xt3r01 it looks like there are keepalive settings as documented here that you could configure in this case. Does that help your use case?
This was fixed by doing:
tcp_keepalive: True
tcp_keepalive_idle: 60
master_alive_interval: 30
Can't tcp keepalive be added to the master so defaults would work ? Just for consistency.
@d3xt3r01 was the above set in the minion config or the master config?
@UtahDave Hey ! It's on the minions. The master doesn't have/support tcp_keepalive ! :-(
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.
There is just ONE out of 10 minions that has this issue After a few days, hundreds of "ESTABLISHED" connections will linger on the master, yet only one on the minion.