Closed DaveQB closed 11 years ago
We ran into this as well, so you are running 0.10.4, with sub_timeout set to 0 correct? And pub_refresh set to False on the master? In the minion's minion.py file on line 590 the zmq.identity is only set it sub_timeout is turned on, can you change it in your minions to have the identity always on? Also, you are on zeromq 2.2.0 for zeromq and pyzmq correct?
I'll be succinct
Yes on 10.4 for all minions and the master.
On master pub_refresh: False
On minion: sub_timeout: 0
Made the below change on one box, I will copy this change to the others.
--- minion.py 2012-11-30 15:21:04.295266297 +1100
+++ minion.py-orig 2012-11-30 15:20:44.423268818 +1100
@@ -586,8 +586,8 @@
epoller = zmq.Poller()
socket = context.socket(zmq.SUB)
socket.setsockopt(zmq.SUBSCRIBE, '')
- #if self.opts['sub_timeout']:
- socket.setsockopt(zmq.IDENTITY, self.opts['id'])
+ if self.opts['sub_timeout']:
+ socket.setsockopt(zmq.IDENTITY, self.opts['id'])
socket.connect(self.master_pub)
poller.register(socket, zmq.POLLIN)
epoller.register(epull_sock, zmq.POLLIN)
||/ Name
Version Description
+++-===============================================-===============================================-==============================================================================================================
ii libzmq1
2.2.0-1chl1~precise1 ZeroMQ lightweight
messaging kernel (shared library)
ii python-zmq
2.1.11-1 Python bindings for 0MQ
library
PS Should I be upgrading to 10.5 before testing this change in the minion.py file? I'll await your advice Thomas.
Thanks again!
Regards David
Yes, I am wondering if 0.10.5 minions and master will hold
Something has gone awry with my github markdown there.
Ok I will endeavour to upgrade all minions and masters.
Had a funny experience then. Tried a salt-call on 2 of the minions who fail to respond to the master, and they executed commands fine. Even did a highstate. Not sure if the minions keep a cache of the configs for when they can't contact the master. A salt-call test.ping returned "local: True" on said minions.
Right, the minions can still communicate back up with the master via the ret port, it is the publish connection that is failing
Oh right. Ok.
I'll get the environment up to 10.5 then and see how we go. Don't why I am having such a drama with simple maintaining a connection. Looking forward to getting onto more exiting bug reporting :)
Thanks Thomas.
Thank you for your tireless work here!
Oh no problems. I see a great future in this project and so glad to be involved in a small way.
Just a note, I have all minions and the master up to version 10.5.
Let's see how this goes.
Still getting these though. But this is for another ticket.
State: - pkg
Name: ssmtp
Function: installed
Result: False
Comment: An exception occured in this state: Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/salt/state.py", line 884, in call
*cdata['args'], **cdata['kwargs'])
File "/usr/lib/pymodules/python2.7/salt/states/pkg.py", line 60, in installed
cver = __salt__['pkg.version'](name)
File "/usr/lib/pymodules/python2.7/salt/modules/apt.py", line 73, in version
pkgs = list_pkgs(name)
File "/usr/lib/pymodules/python2.7/salt/modules/apt.py", line 302, in list_pkgs
ret[cols[3]] = cols[4]
IndexError: list index out of range
Thanks for filling this one :)
This is the next evolution in my issue with minions dropping connection with the master on an AWS setup (inside a VPC).
Issue #2404 was where I first reported. There I had some symptoms but a different cause. In Issue #2404 the minions had many orphaned salt-minion processes that blocked a connection back to the master.
Now with the same symptoms. I have the minions not in an orphaned state but simply not trying to connect to the master at all. I have 7 out of 9 systems not connecting after the environment has been left untouched for several days.
Here is a tcpdump on minion system in question on both ports 4505 and 4506:
This was left for about 10 mins.
An strace shows very little activity as well:
The line here...
...is interesting and hopefully provides a quick solution to my issue.