saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.09k stars 5.47k forks source link

[bug] 0.9.8 intermittent minion-crash #1096

Closed halberom closed 12 years ago

halberom commented 12 years ago

CentOS release 6.2 (Final)
salt-minion-0.9.8-1.el6.noarch
salt-0.9.8-1.el6.noarch
zeromq-2.1.9-1.el6.x86_64

Issue: The salt-minion dies without any errors or stacktrace, leaving the pid file behind.

Output: root 29911 1 0 04:14 ? 00:00:59 /usr/bin/python /usr/bin/salt-minion -d
root 29113 29911 0 12:59 ? 00:00:00 /usr/bin/python /usr/bin/salt-minion -d
root 29114 29113 0 12:59 ? 00:00:00 ps -efH
2012-04-17 12:59:05,069 [salt.minion ][INFO ] Returning information for job: 20120417125905007698
[root@myserver ~]# date
Tue Apr 17 15:36:36 GMT 2012
[root@myserver ~]# ps aux | grep salt
root 6066 0.0 0.0 105284 884 pts/0 S+ 15:36 0:00 grep salt
[root@lon3sqlnodedb02 ~]# cat /var/run/salt-minion.pid
29911[root@myserver ~]#

It appears to happen during idle periods. I run the rare cmd.run, and even rarer state.highstate. I catch the issue by Nagios checking for the daemon (5min interval).

I've set the minion to log debug output, but I'm not hopeful, from previous conversations by others I believe it still doesn't catch anything.

Thanks

thatch45 commented 12 years ago

Thanks for the info, so the actual minion daemon is crashing. I wonder if this has anything to do with the long wait healing that is in there. I will try to reproduce this

thatch45 commented 12 years ago

Ok, I found 2 minion traces with minionswarm, I am having a hard time killing minions now, this was a bunch of commits. I am still hunting though, I have been able to get minions to return None randomly with test.ping

thatch45 commented 12 years ago

so far no crashes, still looking, but this might be completely fixed

halberom commented 12 years ago

That's sounding good, you mentioned you've found traces - are you referring to tracebacks? Because in all cases, there have been none in my experience.

Here's the log from one today - times are in GMT

[root@myserver ~]# tail -50 /var/log/salt/minion
2012-04-19 03:08:02,534 [salt.minion    ][WARNING ] Starting the Salt Minion
2012-04-19 03:08:02,534 [salt.minion    ][DEBUG   ] Attempting to authenticate with the Salt Master
2012-04-19 03:08:02,535 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem
2012-04-19 03:08:02,577 [salt.crypt     ][DEBUG   ] Decrypting the current master AES key
2012-04-19 03:08:02,578 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem
2012-04-19 03:08:02,645 [salt.minion    ][INFO    ] Authentication with master successful!
2012-04-19 03:08:02,646 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem
2012-04-19 03:08:02,685 [salt.crypt     ][DEBUG   ] Decrypting the current master AES key
2012-04-19 03:08:02,685 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem

Our nagios alert fired - note there's nothing in the log between 03:08 and 07:52

[Thu Apr 19 07:52:05 2012] SERVICE ALERT: myserver;check_proc salt-minion;CRITICAL;SOFT;1;PROCS CRITICAL: 0 processes with command name 'salt-minion', UID = 0 (root)
[Thu Apr 19 07:53:05 2012] SERVICE ALERT: myserver;check_proc salt-minion;CRITICAL;SOFT;2;PROCS CRITICAL: 0 processes with command name 'salt-minion', UID = 0 (root)
[Thu Apr 19 07:54:05 2012] SERVICE ALERT: myserver;check_proc salt-minion;CRITICAL;SOFT;3;PROCS CRITICAL: 0 processes with command name 'salt-minion', UID = 0 (root)
[Thu Apr 19 07:55:07 2012] SERVICE ALERT: myserver;check_proc salt-minion;CRITICAL;SOFT;4;PROCS CRITICAL: 0 processes with command name 'salt-minion', UID = 0 (root)
[Thu Apr 19 07:56:06 2012] SERVICE ALERT: myserver;check_proc salt-minion;CRITICAL;HARD;5;PROCS CRITICAL: 0 processes with command name 'salt-minion', UID = 0 (root)

I thought maybe it could be our logrotate

[root@myserver ~]# cat /etc/logrotate.d/salt-minion
/var/log/salt/minion {
    missingok
    notifempty
    rotate 7
    daily
    sharedscripts
    delaycompress
    postrotate
        /sbin/service salt-minion condrestart
    endscript
}

but we have it setup to run at 10am GMT

[root@myserver ~]# cat /etc/crontab
# This file is controlled through Salt, do not modify it locally
##############################################################################
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/

# run-parts
01 * * * * root run-parts /etc/cron.hourly
02 10 * * * root run-parts /etc/cron.daily
22 11 * * 0 root run-parts /etc/cron.weekly
42 12 1 * * root run-parts /etc/cron.monthly
thatch45 commented 12 years ago

I am still monitoring minions, but tracebacks don't always get logged and can sometimes only be seen in the foreground, I have also changed some of the zeromq settings, and zeromq can crash closer to silently

thatch45 commented 12 years ago

so far I cannot reproduce this anymore, I think we got it!

halberom commented 12 years ago

Sorry to be the bearer of bad news, but I think this may still be an issue.

I've been seeing minions die sporadically since 0.9.9 (rarer than before), and was hopeful the large bug fixes included in 0.10 would help further.

Unfortunately we're now on 0.10.1, and I've seen 2 minions die in the last two days - exactly the same symptoms.

root@lon3collector02 ~]# tail /var/log/salt/minion
root     19417     1  0 03:34 ?        00:00:34   /usr/bin/python /usr/bin/salt-minion -d
root     31818 19417  0 11:10 ?        00:00:00     /usr/bin/python /usr/bin/salt-minion -d
root     31819 31818  0 11:10 ?        00:00:00       ps -efH
2012-06-26 11:10:07,042 [salt.minion    ][INFO    ] Returning information for job: 20120626111007010828
2012-06-26 16:05:11,244 [salt.crypt     ][WARNING ] Failed to authenticate message
2012-06-26 16:05:11,244 [salt.minion    ][DEBUG   ] Attempting to authenticate with the Salt Master
2012-06-26 16:05:11,246 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem
2012-06-26 16:05:11,258 [salt.crypt     ][DEBUG   ] Decrypting the current master AES key
2012-06-26 16:05:11,258 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem
2012-06-26 16:05:11,307 [salt.minion    ][INFO    ] Authentication with master successful!
[root@lon3collector02 ~]# ps aux | grep salt
root     31406  0.0  0.0 105284   888 pts/0    S+   16:49   0:00 grep salt
[root@lon3collector02 ~]# ls /var/run/salt-minion.pid 
/var/run/salt-minion.pid
[root@lon3collector02 ~]# 

[root@lon3collector02 ~]# cat /etc/redhat-release 
CentOS release 6.2 (Final)
[root@lon3collector02 ~]# rpm -qa | grep salt
salt-minion-0.10.1-1.el6.noarch
salt-0.10.1-1.el6.noarch

Let me know if there's anything I can do to give further info

halberom commented 12 years ago

Performing a "service salt-minion restart" shows no obvious issues in the log

[root@lon3collector02 ~]# tail -15 /var/log/salt/minion
2012-06-26 16:05:11,244 [salt.minion    ][DEBUG   ] Attempting to authenticate with the Salt Master
2012-06-26 16:05:11,246 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem
2012-06-26 16:05:11,258 [salt.crypt     ][DEBUG   ] Decrypting the current master AES key
2012-06-26 16:05:11,258 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem
2012-06-26 16:05:11,307 [salt.minion    ][INFO    ] Authentication with master successful!
2012-06-26 16:54:25,254 [salt.loader    ][DEBUG   ] Failed to import module rh_ip, this is most likely NOT a problem: No module named pkg_resources
2012-06-26 16:54:25,260 [salt.minion    ][WARNING ] Starting the Salt Minion
2012-06-26 16:54:25,261 [salt.minion    ][DEBUG   ] Attempting to authenticate with the Salt Master
2012-06-26 16:54:25,262 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem
2012-06-26 16:54:25,272 [salt.crypt     ][DEBUG   ] Decrypting the current master AES key
2012-06-26 16:54:25,273 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem
2012-06-26 16:54:25,328 [salt.minion    ][INFO    ] Authentication with master successful!
2012-06-26 16:54:25,330 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem
2012-06-26 16:54:25,346 [salt.crypt     ][DEBUG   ] Decrypting the current master AES key
2012-06-26 16:54:25,346 [salt.crypt     ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion.pem

[root@lon3collector02 ~]# ps aux | grep salt
root     31948  0.2  0.2 523984 26280 ?        Sl   16:54   0:00 /usr/bin/python /usr/bin/salt-minion -d
root     32242  0.0  0.0 105284   892 pts/0    S+   16:56   0:00 grep salt
[root@lon3collector02 ~]# ls /var/run/salt-minion.pid 
/var/run/salt-minion.pid
thatch45 commented 12 years ago

Ok, I reopened this, I am wondering if there is a stack trace anywhere that we are missing

halberom commented 12 years ago

Thank you

I don't believe that there's anything wrong with our network, though I'm assuming that there'd be obvious errors if there were dns problems or loss of connectivity to the master etc.

Are there any internal stats I could be polling for things like buffers or connection counts etc?

gordonm commented 12 years ago

@thatch45, I have some strace output from a reproducible minion crash on 0.10.1, do you have a preferred place for me to upload it (less than 1MB compressed, maybe email?)

thatch45 commented 12 years ago

That would be fantastic, feel free to email me: thatch45 at gmail.com

thatch45 commented 12 years ago

https://zeromq.jira.com/browse/LIBZMQ-356

Is the closest thing I am finding so far. The only thing that looks out of place in this stack trace is the mailbox.cpp assertion. It looks like this is happening when initializing a minion execution. I will make some changes to see if we can bypass this. I will also look more closely into the zeromq assertion

halberom commented 12 years ago

Thanks for continuing to investigate this. FWIW, most of our minion crashes seem to occur outside of times when any calls are being made to them.

Would something breaking within the initialisation sequence bypass existing stacktrace handling?

thatch45 commented 12 years ago

I don't think so, I suspect that there is an issue with the SUB socket in zeroMQ. I don't remember, have you tried running your minions with the option sub_timeout: 0 in the minion config?

thatch45 commented 12 years ago

I added a try/except block around the minion loops with critical logging, hopefully this will pick up the problem

halberom commented 12 years ago

The only diffs in my minion configs (from default) are the master hostname, and logging = debug. I haven't played around with the sub_timeout setting or any others.

Crossing fingers your minionswarm comes up with something useful; if the try/except block are an easy insert, I can also try them on our live servers.

thatch45 commented 12 years ago

6ec8c49238c8254429b09e6d3aabb0fb04301cf3 is the try/except commit. Don't take it all since the lines also have the new minion event manager which I want to test more.

halberom commented 12 years ago

I've added the try/except statements, excluding the epoller.poll bits, to minion.py. I'll send you anything I get when a minion dies. Cheers!

thatch45 commented 12 years ago

Ok, this assertion is generally used for when two processes attempt to share a single zeroMQ infrastructure. In the case of Salt it would be if the same socket object is used in a multiprocess.

The thing is, if you do this then the assertion happens every time. I have reviewed the code and I don't see anywhere that Salt could be doing this. So I am thinking it is associated still with the aforementioned zeromq bug. I will report this info with the zeromq guys and see if we can figure out the issue

halberom commented 12 years ago

I had a minion die, and the logging showed nothing, which matches up with your reasoning. Here's hoping the zeromq issue can be identified. Thanks for all your time on this.

thatch45 commented 12 years ago

Ok, I am working with the ZeroMQ guys but they are also kind of stumped. So I am still looking and trying to reliably reproduce this

halberom commented 12 years ago

If nobody else is experiencing the issue, then probably not worth spending more time on it.

I'm still puzzled as to why there's no stacktrace dump, given what I understand of the code, pretty much everything's got try/except's.

gordonm commented 12 years ago

Still biting us here with 0.10.1 and zmq 2.2.0 on el6. I can reproduce the crash I sent @thatch45 last week, and we still get minions dropping offline on occasion (as seen using manage.down). I've not applied any patches to 0.10.1 however, so if there's something I can test please let me know.

halberom commented 12 years ago

The only way I can think of for there to be no stacktrace is if the parent thread is killed/dies. I've taken a look at things like lsof output, and aside from the occasional deleted file hanging around (e.g. /tmp/ffiz9iv7o) I can't see anything odd. I had a read up on daemonizing via python and some mention closing fd's of parent threads when opening the child ones - no idea if that's at all applicable, maybe if the parent opens connections (and fd's) and those aren't closed off properly it could cause issues later.

spewing a load of garbage here, no clue

gordonm commented 12 years ago

@halberom, in the case of the crash I can reproduce, post-crash the original parent process is gone. Here's what the process tree looks like:

Pre-crash:

-salt-minion(5969)-+-{salt-minion}(5980)
                   `-{salt-minion}(5981)

Post-crash:

-salt-minion(6310)

This could just be a result of the code not cleaning up properly but I can't tell for sure. In any case the salt-minion process that gets left behind is unresponsive to any calls from the master.

thatch45 commented 12 years ago

Sorry guys, I have been swamped these last few days but I am still hunting on this one. @gordonm what you are seeing is that the main minion process dies and the process that is hanging around is a salt job. This job might actually hold the key so I will look more closely here again

thatch45 commented 12 years ago

@gordonm, I have been pouring over the trace data and I think I have a good bearing for what is happening, there is a child process, something that the master executes first that sleeps for about 25 seconds, while it sleeps the other processes start and after the sleep time is over the first process attempts to connect to a multiprocess shared zeromq socket. Can you tell me what exactly you are running, it looks like we have a zeromq race condition somewhere

gordonm commented 12 years ago

@thatch45, I'll email you a copy of the module we are executing during that trace.

thatch45 commented 12 years ago

I have been looking very closely here, I think it is a conflict with accessing occupied memory inside zeromq, but I am not sure yet, I will have some stuff for you to try out soon

thatch45 commented 12 years ago

thanks @gordon, If i can reliably reproduce this I will have it killed very quickly

thatch45 commented 12 years ago

FWIW, I have been running a few swarms in different situations for a few days now without and minion crashes. I am thinking that some of the recent changes and the fix to #1428 might have done the trick, but I am not counting my chickens yet.

thatch45 commented 12 years ago

still unable to repeat the crash on the latest code with a number of test cases all involving 500 minions. I am not singing victory, but it is looking good. Keep in mind that there have been a number of zeromq modifications. I am trying to get 0.10.2 out soon to see if you all can replicate the crashes.

mlsorensen commented 12 years ago

FWIW, I've been running salt 0.9.8 for awhile (on about 60 boxes) with zeromq-2.1.9 and have had intermittent minion failures, 2 or 3 a day (zabbix notifies me). I've had it since I started my eval (on 0.9.4?), and have just set cfengine to restart it when it fails, since I haven't had the time to troubleshoot and figured it would be fixed eventually :-) I currently only use salt for mass ad-hoc command execution. Just thought I'd mention it in case the impression is that nobody else is experiencing the issue.

thatch45 commented 12 years ago

thanks @mlsorensen This is a widely experienced issue, and we have been able to make it progressively better in recent releases.

marcinkuzminski commented 12 years ago

I had the same issue on 10.1 that custom module crash caused minion to become zombie, after installing todays develop code this issues if fixed for me. I had always reproducible example that now is fixed. Thanks !

thatch45 commented 12 years ago

W00t! hopefully 0.10.2 will be out soon

marcinkuzminski commented 12 years ago

during some more testing, i found another issue:

[19:45:26](/usr/lib64/python2.7/site-packages)
root@dell-xps~# salt '*' test.ping
192_168_1_38-a67d72825e114ff2b8fa38f276bbeef7: True
192_168_1_35-ed71a0d992d74099b81997f1d5dd5810: True

root@dell-xps~# ps aux | grep salt-minion | grep -v grep && cat /var/run/salt-minion.pid 
root     12443  1.0  0.2 475932 22772 ?        Sl   19:44   0:01 /usr/bin/python2.7 /usr/bin/salt-minion --daemon --log-level=warning
12443

## now i run my breaking task...
root@dell-xps~# salt '*' appzonaut.manage_keys "{}"
{'192_168_1_38-a67d72825e114ff2b8fa38f276bbeef7': '"appzonaut.manage_keys" is not available.'}
{'192_168_1_35-ed71a0d992d74099b81997f1d5dd5810': 'Traceback ... <myexpection>
[19:47:18](/usr/lib64/python2.7/site-packages)

## in logs
19:47:18,884 [salt.minion          ][WARNING ] The minion function caused an exception: Traceback (most recent call last):
<myexception>

19:47:18,888 [salt.minion          ][CRITICAL] A fault occured in the main minion loop [Errno 9] Bad file descriptor

root@dell-xps~# ps aux | grep salt-minion | grep -v grep && cat /var/run/salt-minion.pid 
root     12443  0.8  0.2 475932 22772 ?        Sl   19:44   0:01 /usr/bin/python2.7 /usr/bin/salt-minion --daemon --log-level=warning
root     12658  0.8  0.3 496164 25144 ?        Sl   19:47   0:00 /usr/bin/python2.7 /usr/bin/salt-minion --daemon --log-level=warning <-- there's a new pid
12443 # 

root@dell-xps~# /etc/init.d/salt-minion restart
 * Stopping SALT minion daemon ...
 * start-stop-daemon: no matching processes found                   #<--- missing process now ?                                                                                                                                                                   [ ok ]
 * Starting SALT minion daemon ... 

root@dell-xps~# salt '*' test.ping
192_168_1_38-a67d72825e114ff2b8fa38f276bbeef7: True

#now after restart it does't see the minion because new process is created.

Let me know if anything else i can help with

marcinkuzminski commented 12 years ago

As we're here at this issue I would really like to see some kind of deamon lock that would prevent for running multiple instances, if someone want to run them on different port than maybe a flag --new-instance could be used.

I stumble across different issues when some of our devs runned by accident multiple times salt-minion -d on machine. You end up with multiple running minion instances and master is confused

mlsorensen commented 12 years ago

So I think I got lost somewhere along the way, but is the fix going to require a zeromq update, or is it purely a fix on the salt side?

On Thu, Jul 26, 2012 at 11:44 AM, Thomas S Hatch reply@reply.github.com wrote:

W00t! hopefully 0.10.2 will be out soon


Reply to this email directly or view it on GitHub: https://github.com/saltstack/salt/issues/1096#issuecomment-7284899

thatch45 commented 12 years ago

@marcinkuzminski, thanks for the data. Can you please file another issue for the multiple minions bug? The problem is really that there is an id conflict, but please make another issue so I remember to fix it

thatch45 commented 12 years ago

@mlsorensen, it is best to use the latest zeromq, but this fix is a salt update.

thatch45 commented 12 years ago

@marcinkuzminski Can you test out the latest push to see if you can still reproduce this?

marcinkuzminski commented 12 years ago

@thatch45 yep seems to be fixed when exception happens inside a module ! found one odd issue that if you put a print statement inside module function it creates new process again (but the old one dies) it leads to same issue

thatch45 commented 12 years ago

@marcinkuzminski this sounds odd, can you send some code over that does this? Prints never give me any problems...

marcinkuzminski commented 12 years ago

here is the module i upload to minions and execute it, it's a new thing copied to /salt/modules

'''
Module for running <module> specific commands
'''

import traceback
import logging

# Load in default options for the module
__opts__ = {

}
# Load the outputters for the module
__outputter__ = {
    'test': 'json',
    'manage_keys': 'text'
}

# Set up logging
log = logging.getLogger(__name__)

def test():
    '''
    Test appzonaut command

    CLI Example::

        salt '*' <module>.test
    '''
    return 'hello'

def manage_keys(param):
    import json
    log.debug('Got json data %r' % param)
    print 'now i die !'
thatch45 commented 12 years ago

thanks! I will try to reproduce this

thatch45 commented 12 years ago

yes, print is clearly killing the minion, and only when it is running in daemon mode!

thatch45 commented 12 years ago

Found the stack trace, so when we daemonize there is an issue with the multiprocessing sys.stdout object. I have a patch that catches the error and prevents it from failing but it is still being thrown. I will have the catch up so you can double check

  File "/usr/lib/python2.7/site-packages/salt/minion.py", line 520, in tune_in
    self._handle_payload(payload)
  File "/usr/lib/python2.7/site-packages/salt/minion.py", line 174, in _handle_payload
    'clear': self._handle_clear}[payload['enc']](payload['load'])
  File "/usr/lib/python2.7/site-packages/salt/minion.py", line 211, in _handle_aes
    self._handle_decoded_payload(data)
  File "/usr/lib/python2.7/site-packages/salt/minion.py", line 241, in _handle_decoded_payload
    target=lambda: self._thread_return(data)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 126, in __init__
    sys.stdout.flush()
IOError: [Errno 9] Bad file descriptor
thatch45 commented 12 years ago

Ok, I updated with the code that kills the problem for me, please give it a shot