saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.19k stars 5.48k forks source link

Minions dropping connection to master :: 10.4 (strace and tcpdump inc) #2753

Closed DaveQB closed 11 years ago

DaveQB commented 11 years ago

This is the next evolution in my issue with minions dropping connection with the master on an AWS setup (inside a VPC).

Issue #2404 was where I first reported. There I had some symptoms but a different cause. In Issue #2404 the minions had many orphaned salt-minion processes that blocked a connection back to the master.

Now with the same symptoms. I have the minions not in an orphaned state but simply not trying to connect to the master at all. I have 7 out of 9 systems not connecting after the environment has been left untouched for several days.

Here is a tcpdump on minion system in question on both ports 4505 and 4506:


root@prd-scripts01-ap-southeast-1a:~# tcpdump   port 4506 and port 4505   -s 0   -i any -vv
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes

^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel
root@prd-scripts01-ap-southeast-1a:~# 

This was left for about 10 mins.

An strace shows very little activity as well:


root@prd-scripts01-ap-southeast-1a:~# ps aux|grep minion
root     24824  0.0  0.0   8104   920 pts/0    S+   13:34   0:00 grep --color=auto minion
root     31243  0.0  2.0 385300 35172 ?        Ssl  Nov14   0:19 /usr/bin/python /usr/bin/salt-minion
root@prd-scripts01-ap-southeast-1a:~# strace -f -p 31243
Process 31243 attached with 3 threads - interrupt to quit
[pid 31265] epoll_wait(11,  
[pid 31264] epoll_wait(8,  
[pid 31243] restart_syscall(<... resuming interrupted call ...>) = 0
[pid 31243] poll([{fd=22, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 31243] clock_gettime(CLOCK_MONOTONIC, {2201983, 601093596}) = 0
[pid 31243] select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
[pid 31243] stat("/var/cache/salt/module_refresh", 0x7fffe218dad0) = -1 ENOENT (No such file or directory)
[pid 31243] clock_gettime(CLOCK_MONOTONIC, {2201983, 652031253}) = 0
[pid 31243] poll([{fd=18, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 31243] poll([{fd=18, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 31243] clock_gettime(CLOCK_MONOTONIC, {2201983, 652639191}) = 0
[pid 31243] poll([{fd=18, events=POLLIN}], 1, 1) = 0 (Timeout)
[pid 31243] poll([{fd=18, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 31243] clock_gettime(CLOCK_MONOTONIC, {2201983, 654233943}) = 0
[pid 31243] clock_gettime(CLOCK_MONOTONIC, {2201983, 654545125}) = 0
[pid 31243] poll([{fd=22, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 31243] poll([{fd=22, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 31243] clock_gettime(CLOCK_MONOTONIC, {2201983, 655205322}) = 0
[pid 31243] poll([{fd=22, events=POLLIN}], 1, 60000

The line here...


[pid 31243] stat("/var/cache/salt/module_refresh", 0x7fffe218dad0) = -1 ENOENT (No such file or directory)

...is interesting and hopefully provides a quick solution to my issue.

thatch45 commented 11 years ago

We ran into this as well, so you are running 0.10.4, with sub_timeout set to 0 correct? And pub_refresh set to False on the master? In the minion's minion.py file on line 590 the zmq.identity is only set it sub_timeout is turned on, can you change it in your minions to have the identity always on? Also, you are on zeromq 2.2.0 for zeromq and pyzmq correct?

DaveQB commented 11 years ago

I'll be succinct

Yes on 10.4 for all minions and the master.

On master pub_refresh: False

On minion: sub_timeout: 0

Made the below change on one box, I will copy this change to the others.


--- minion.py   2012-11-30 15:21:04.295266297 +1100
+++ minion.py-orig      2012-11-30 15:20:44.423268818 +1100
@@ -586,8 +586,8 @@
         epoller = zmq.Poller()
         socket = context.socket(zmq.SUB)
         socket.setsockopt(zmq.SUBSCRIBE, '')
-        #if self.opts['sub_timeout']:
-        socket.setsockopt(zmq.IDENTITY, self.opts['id'])
+        if self.opts['sub_timeout']:
+            socket.setsockopt(zmq.IDENTITY, self.opts['id'])
         socket.connect(self.master_pub)
         poller.register(socket, zmq.POLLIN)
         epoller.register(epull_sock, zmq.POLLIN)

||/ Name                                           
Version                                         Description
+++-===============================================-===============================================-==============================================================================================================
ii  libzmq1                                        
2.2.0-1chl1~precise1                            ZeroMQ lightweight
messaging kernel (shared library)
ii  python-zmq                                     
2.1.11-1                                        Python bindings for 0MQ
library

PS Should I be upgrading to 10.5 before testing this change in the minion.py file? I'll await your advice Thomas.

Thanks again!

Regards David

thatch45 commented 11 years ago

Yes, I am wondering if 0.10.5 minions and master will hold

DaveQB commented 11 years ago

Something has gone awry with my github markdown there.

Ok I will endeavour to upgrade all minions and masters.

Had a funny experience then. Tried a salt-call on 2 of the minions who fail to respond to the master, and they executed commands fine. Even did a highstate. Not sure if the minions keep a cache of the configs for when they can't contact the master. A salt-call test.ping returned "local: True" on said minions.

thatch45 commented 11 years ago

Right, the minions can still communicate back up with the master via the ret port, it is the publish connection that is failing

DaveQB commented 11 years ago

Oh right. Ok.

I'll get the environment up to 10.5 then and see how we go. Don't why I am having such a drama with simple maintaining a connection. Looking forward to getting onto more exiting bug reporting :)

Thanks Thomas.

thatch45 commented 11 years ago

Thank you for your tireless work here!

DaveQB commented 11 years ago

Oh no problems. I see a great future in this project and so glad to be involved in a small way.

DaveQB commented 11 years ago

Just a note, I have all minions and the master up to version 10.5.

Let's see how this goes.


Still getting these though. But this is for another ticket.


    State: - pkg
    Name:      ssmtp
    Function:  installed
        Result:    False
        Comment:   An exception occured in this state: Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/salt/state.py", line 884, in call
    *cdata['args'], **cdata['kwargs'])
  File "/usr/lib/pymodules/python2.7/salt/states/pkg.py", line 60, in installed
    cver = __salt__['pkg.version'](name)
  File "/usr/lib/pymodules/python2.7/salt/modules/apt.py", line 73, in version
    pkgs = list_pkgs(name)
  File "/usr/lib/pymodules/python2.7/salt/modules/apt.py", line 302, in list_pkgs
    ret[cols[3]] = cols[4]
IndexError: list index out of range
thatch45 commented 11 years ago

Thanks for filling this one :)