saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14k stars 5.47k forks source link

Timeouts (Minion did not return) since 2014.1.7 #14343

Closed Whissi closed 8 years ago

Whissi commented 9 years ago

Hi,

when there was no Salt activity for ~30-60 minutes, the next Salt command (for example salt '*' --show-timeout cmd.run 'uname -a' will timeout for most minions.

I need to run this command 2-3x times... after 1-2 minutes all minions are back and now I can run my desired command without any timeouts.

This didn't happen with 2014.1.5.

# salt --versions-report
           Salt: 2014.1.7
         Python: 2.7.7 (default, Jul  6 2014, 23:14:10)
         Jinja2: 2.7.3
       M2Crypto: 0.22
 msgpack-python: 0.4.2
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
         PyYAML: 3.11
          PyZMQ: 14.3.1
            ZMQ: 4.0.4
basepi commented 9 years ago

Very strange! Thanks for the report, we'll investigate this.

basepi commented 9 years ago

When you run the commands, does the job show up in the minion log? Wondering if it's a loss of connection, or just a problem with the return data, or what...

sigwinch28 commented 9 years ago

Experiencing the same issue. CentOS 7.

# salt --versions-report
           Salt: 2014.1.7
         Python: 2.7.5 (default, Jun 17 2014, 18:11:42)
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
 msgpack-python: 0.1.13
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
         PyYAML: 3.10
          PyZMQ: 14.3.1
            ZMQ: 3.2.4

No indications in minion logs apart from:

SaltReqTimeoutError: Waited 60 seconds
basepi commented 9 years ago

@JoeHazzers Did you also just start seeing it on 2014.1.7, or have you seen it previously?

Whissi commented 9 years ago

It is not easy to reproduce the problem. Currently it is gone:

After posting this issue I enabled debug mode for master and 2 minions.

Also, we rebooted every server in our network over the weekend (kernel upgrade).

Since we rebooted every system we haven't seen this problem anymore. So it looks like a missing reboot or something like that. Or maybe running in debug mode doesn't trigger the bug?

basepi commented 9 years ago

I doubt debug log level affects this bug, but who knows? Keep us posted if you find a way to reproduce this issue, or if you get those logs that I mentioned.

sigwinch28 commented 9 years ago

@basepi I have only seen this issue on 2014.1.7 However, like @Whissi after leaving it for a day or two and a system reboot, it now works fine. I can no longer reproduce the issue.

basepi commented 9 years ago

Very strange. I'll keep this open for now, keep us posted!

Whissi commented 9 years ago

Now it happened again. No software was updated so I am 100% sure that there was no pending reboot.

But as said, debug log level was turned off. So nothing I can show you. I'll turn on logging again, maybe this will only happen when the master was running for some hours...

basepi commented 9 years ago

Curses. =) It happens. If you don't mind keeping debug logging on for a time, that would be helpful.

Whissi commented 9 years ago

OK, here we go.I was able to reproduce the issue in a reduced test network:

Seems to be an authentication problem (Failed to authenticate message) which will happen sometimes.

# salt '*' --show-timeout cmd.run 'date'
srv5.example.intern:
    Tue Jul 22 16:39:17 CEST 2014
srv4.example.intern:
    Tue Jul 22 16:39:17 CEST 2014
srv2.example.intern:
    Tue Jul 22 16:39:17 CEST 2014
srv7.example.intern:
    Tue Jul 22 16:39:17 CEST 2014
srv6.example.intern:
    Tue Jul 22 16:39:17 CEST 2014
srv3.example.intern:
    Minion did not return
srv8.example.intern:
    Minion did not return

master debug log:

2014-07-22 16:15:52,608 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:15:52,611 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:15:52,611 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:16:52,677 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:16:52,680 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:16:52,680 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:17:52,746 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:17:52,749 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:17:52,749 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:18:52,815 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:18:52,817 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:18:52,818 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:19:52,883 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:19:52,885 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:19:52,886 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:20:05,690 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:20:05,690 [salt.master                                 ][INFO    ] Authentication request from srv4.example.intern
2014-07-22 16:20:05,690 [salt.master                                 ][INFO    ] Authentication accepted from srv4.example.intern
2014-07-22 16:20:05,907 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:20:05,907 [salt.master                                 ][INFO    ] Authentication request from srv4.example.intern
2014-07-22 16:20:05,907 [salt.master                                 ][INFO    ] Authentication accepted from srv4.example.intern
2014-07-22 16:20:06,096 [salt.master                                 ][INFO    ] AES payload received with command _mine
2014-07-22 16:20:08,293 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:20:08,293 [salt.master                                 ][INFO    ] Authentication request from srv7.example.intern
2014-07-22 16:20:08,293 [salt.master                                 ][INFO    ] Authentication accepted from srv7.example.intern
2014-07-22 16:20:08,617 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:20:08,617 [salt.master                                 ][INFO    ] Authentication request from srv7.example.intern
2014-07-22 16:20:08,618 [salt.master                                 ][INFO    ] Authentication accepted from srv7.example.intern
2014-07-22 16:20:08,877 [salt.master                                 ][INFO    ] AES payload received with command _mine
2014-07-22 16:20:52,944 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:20:52,947 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:20:52,947 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:21:53,004 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:21:53,007 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:21:53,007 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:22:53,073 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:22:53,075 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:22:53,076 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:23:53,141 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:23:53,144 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:23:53,144 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:24:53,209 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:24:53,212 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:24:53,213 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:25:53,278 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:25:53,281 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:25:53,281 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:26:53,347 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:26:53,350 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:26:53,350 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:27:53,415 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:27:53,418 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:27:53,418 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:28:53,484 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:28:53,487 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:28:53,487 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:29:53,553 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:29:53,556 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:29:53,556 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:30:08,595 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:30:08,595 [salt.master                                 ][INFO    ] Authentication request from srv6.example.intern
2014-07-22 16:30:08,596 [salt.master                                 ][INFO    ] Authentication accepted from srv6.example.intern
2014-07-22 16:30:08,812 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:30:08,812 [salt.master                                 ][INFO    ] Authentication request from srv6.example.intern
2014-07-22 16:30:08,813 [salt.master                                 ][INFO    ] Authentication accepted from srv6.example.intern
2014-07-22 16:30:09,001 [salt.master                                 ][INFO    ] AES payload received with command _mine
2014-07-22 16:30:44,313 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:30:44,313 [salt.master                                 ][INFO    ] Authentication request from srv5.example.intern
2014-07-22 16:30:44,313 [salt.master                                 ][INFO    ] Authentication accepted from srv5.example.intern
2014-07-22 16:30:44,529 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:30:44,529 [salt.master                                 ][INFO    ] Authentication request from srv5.example.intern
2014-07-22 16:30:44,529 [salt.master                                 ][INFO    ] Authentication accepted from srv5.example.intern
2014-07-22 16:30:44,718 [salt.master                                 ][INFO    ] AES payload received with command _mine
2014-07-22 16:30:53,622 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:30:53,625 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:30:53,625 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:31:48,021 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:31:48,021 [salt.master                                 ][INFO    ] Authentication request from srv3.example.intern
2014-07-22 16:31:48,022 [salt.master                                 ][INFO    ] Authentication accepted from srv3.example.intern
2014-07-22 16:31:48,333 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:31:48,333 [salt.master                                 ][INFO    ] Authentication request from srv3.example.intern
2014-07-22 16:31:48,334 [salt.master                                 ][INFO    ] Authentication accepted from srv3.example.intern
2014-07-22 16:31:48,586 [salt.master                                 ][INFO    ] AES payload received with command _mine
2014-07-22 16:31:53,690 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:31:53,693 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:31:53,694 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:32:53,759 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:32:53,762 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:32:53,762 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:33:53,828 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:33:53,830 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:33:53,831 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:34:53,866 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:34:53,869 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:34:53,869 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:34:57,158 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:34:57,158 [salt.master                                 ][INFO    ] Authentication request from srv2.example.intern
2014-07-22 16:34:57,159 [salt.master                                 ][INFO    ] Authentication accepted from srv2.example.intern
2014-07-22 16:34:57,374 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:34:57,375 [salt.master                                 ][INFO    ] Authentication request from srv2.example.intern
2014-07-22 16:34:57,375 [salt.master                                 ][INFO    ] Authentication accepted from srv2.example.intern
2014-07-22 16:34:57,564 [salt.master                                 ][INFO    ] AES payload received with command _mine
2014-07-22 16:35:53,934 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:35:53,937 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:35:53,938 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:36:54,003 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:36:54,006 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:36:54,006 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:37:54,072 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:37:54,074 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:37:54,075 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:38:54,138 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:38:54,141 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:38:54,142 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:39:17,125 [salt.master                                 ][INFO    ] Clear payload received with command publish
2014-07-22 16:39:17,126 [salt.master                                 ][INFO    ] User root Published command cmd.run with jid 20140722163917125243
2014-07-22 16:39:17,126 [salt.master                                 ][DEBUG   ] Published command details {'tgt_type': 'glob', 'jid': '20140722163917125243', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': ['date'], 'fun': 'cmd.run'}
2014-07-22 16:39:17,143 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:17,144 [salt.master                                 ][INFO    ] Got return from srv5.example.intern for job 20140722163917125243
2014-07-22 16:39:17,145 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:17,145 [salt.master                                 ][INFO    ] Got return from srv4.example.intern for job 20140722163917125243
2014-07-22 16:39:17,146 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:17,146 [salt.master                                 ][INFO    ] Got return from srv2.example.intern for job 20140722163917125243
2014-07-22 16:39:17,169 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:17,169 [salt.master                                 ][INFO    ] Got return from srv7.example.intern for job 20140722163917125243
2014-07-22 16:39:17,170 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:17,170 [salt.master                                 ][INFO    ] Got return from srv6.example.intern for job 20140722163917125243
2014-07-22 16:39:22,135 [salt.master                                 ][INFO    ] Clear payload received with command publish
2014-07-22 16:39:22,136 [salt.master                                 ][INFO    ] User root Published command saltutil.find_job with jid 20140722163922135841
2014-07-22 16:39:22,137 [salt.master                                 ][DEBUG   ] Published command details {'tgt_type': 'glob', 'jid': '20140722163922135841', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': ['20140722163917125243'], 'fun': 'saltutil.find_job'}
2014-07-22 16:39:22,144 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:22,144 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:22,145 [salt.master                                 ][INFO    ] Got return from srv2.example.intern for job 20140722163922135841
2014-07-22 16:39:22,145 [salt.master                                 ][INFO    ] Got return from srv5.example.intern for job 20140722163922135841
2014-07-22 16:39:22,145 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:22,145 [salt.master                                 ][INFO    ] Got return from srv4.example.intern for job 20140722163922135841
2014-07-22 16:39:22,145 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:22,146 [salt.master                                 ][INFO    ] Got return from srv6.example.intern for job 20140722163922135841
2014-07-22 16:39:22,162 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:22,162 [salt.master                                 ][INFO    ] Got return from srv7.example.intern for job 20140722163922135841
2014-07-22 16:39:27,200 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:39:27,200 [salt.master                                 ][INFO    ] Authentication request from srv3.example.intern
2014-07-22 16:39:27,201 [salt.master                                 ][INFO    ] Authentication accepted from srv3.example.intern
2014-07-22 16:39:27,481 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:27,481 [salt.master                                 ][INFO    ] Got return from srv3.example.intern for job 20140722163917125243
2014-07-22 16:39:27,483 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:39:27,483 [salt.master                                 ][INFO    ] Got return from srv3.example.intern for job 20140722163922135841
2014-07-22 16:39:54,179 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:39:54,182 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:39:54,182 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:40:02,253 [salt.master                                 ][INFO    ] Clear payload received with command _auth
2014-07-22 16:40:02,253 [salt.master                                 ][INFO    ] Authentication request from srv8.example.intern
2014-07-22 16:40:02,254 [salt.master                                 ][INFO    ] Authentication accepted from srv8.example.intern
2014-07-22 16:40:02,580 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:40:02,580 [salt.master                                 ][INFO    ] Got return from srv8.example.intern for job 20140722163917125243
2014-07-22 16:40:02,584 [salt.master                                 ][INFO    ] AES payload received with command _return
2014-07-22 16:40:02,585 [salt.master                                 ][INFO    ] Got return from srv8.example.intern for job 20140722163922135841
2014-07-22 16:40:54,248 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:40:54,251 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:40:54,251 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:41:54,317 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:41:54,320 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:41:54,320 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:42:54,386 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:42:54,389 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:42:54,389 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.
2014-07-22 16:43:54,454 [salt.fileserver                             ][DEBUG   ] Updating fileserver cache
2014-07-22 16:43:54,457 [salt.fileserver                             ][DEBUG   ] diff_mtime_map: the maps are the same
2014-07-22 16:43:54,458 [salt.utils.verify                           ][DEBUG   ] This salt-master instance has accepted 7 minion keys.

srv3 minion debug log:

2014-07-22 14:31:42,426 [salt             ][INFO    ] Setting up the Salt Minion "srv3.example.intern"
2014-07-22 14:31:42,431 [salt.utils.process][DEBUG   ] Created pidfile: /run/salt-minion.pid
2014-07-22 14:31:42,432 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion
2014-07-22 14:31:42,509 [salt.config      ][DEBUG   ] Including configuration from '/etc/salt/minion.d/myMinion.conf'
2014-07-22 14:31:42,509 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion.d/myMinion.conf
2014-07-22 14:31:42,783 [salt.minion                              ][DEBUG   ] Attempting to authenticate with the Salt Master at 10.82.0.1
2014-07-22 14:31:42,792 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:31:42,851 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:31:42,994 [salt.crypt                               ][DEBUG   ] Decrypting the current master AES key
2014-07-22 14:31:42,994 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:31:43,109 [salt.minion                              ][INFO    ] Authentication with master successful!
2014-07-22 14:31:43,110 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:31:43,249 [salt.crypt                               ][DEBUG   ] Decrypting the current master AES key
2014-07-22 14:31:43,249 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:31:43,384 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:31:43,443 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion
2014-07-22 14:31:43,521 [salt.config                              ][DEBUG   ] Including configuration from '/etc/salt/minion.d/myMinion.conf'
2014-07-22 14:31:43,521 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion.d/myMinion.conf
2014-07-22 14:31:47,494 [salt.loader                              ][DEBUG   ] Loaded localemod as virtual locale
2014-07-22 14:31:47,495 [salt.loader                              ][DEBUG   ] Loaded groupadd as virtual group
2014-07-22 14:31:47,497 [salt.loader                              ][DEBUG   ] Loaded ebuild as virtual pkg
2014-07-22 14:31:47,501 [salt.loader                              ][DEBUG   ] Loaded parted as virtual partition
2014-07-22 14:31:47,503 [salt.loader                              ][DEBUG   ] Loaded linux_sysctl as virtual sysctl
2014-07-22 14:31:47,523 [salt.loader                              ][DEBUG   ] Loaded sysmod as virtual sys
2014-07-22 14:31:47,530 [salt.loader                              ][DEBUG   ] Loaded zcbuildout as virtual buildout
2014-07-22 14:31:47,532 [salt.loader                              ][DEBUG   ] Loaded djangomod as virtual django
2014-07-22 14:31:47,533 [salt.loader                              ][DEBUG   ] Loaded useradd as virtual user
2014-07-22 14:31:47,535 [salt.loader                              ][DEBUG   ] Loaded gentoo_service as virtual service
2014-07-22 14:31:47,548 [salt.loader                              ][DEBUG   ] Loaded virtualenv_mod as virtual virtualenv
2014-07-22 14:31:47,550 [salt.loader                              ][DEBUG   ] Loaded linux_acl as virtual acl
2014-07-22 14:31:47,551 [salt.loader                              ][DEBUG   ] Loaded cmdmod as virtual cmd
2014-07-22 14:31:47,562 [salt.loader                              ][DEBUG   ] Loaded gentoolkitmod as virtual gentoolkit
2014-07-22 14:31:47,645 [salt.loader                                 ][DEBUG   ] Loaded syslog_return as virtual syslog
2014-07-22 14:31:47,646 [salt.loader                                 ][DEBUG   ] Loaded couchdb_return as virtual couchdb
2014-07-22 14:31:47,646 [salt.loader                                 ][DEBUG   ] Loaded carbon_return as virtual carbon
2014-07-22 14:31:47,647 [salt.minion                                 ][DEBUG   ] I am srv3.example.intern and I am not supposed to start any proxies.
2014-07-22 14:31:47,647 [salt.minion                                 ][INFO    ] Minion is starting as user 'root'
2014-07-22 14:31:47,647 [salt.minion                                 ][DEBUG   ] Minion 'srv3.example.intern' trying to tune in
2014-07-22 14:31:47,648 [salt.minion                                 ][DEBUG   ] Minion PUB socket URI: ipc:///var/run/salt/minion/minion_event_a9a4e31ca4566be0615108ef7b1891be_pub.ipc
2014-07-22 14:31:47,648 [salt.minion                                 ][DEBUG   ] Minion PULL socket URI: ipc:///var/run/salt/minion/minion_event_a9a4e31ca4566be0615108ef7b1891be_pull.ipc
2014-07-22 14:31:47,649 [salt.minion                                 ][DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (7372)
2014-07-22 14:31:47,649 [salt.minion                                 ][DEBUG   ] Setting zmq_reconnect_ivl to '7372ms'
2014-07-22 14:31:47,649 [salt.minion                                 ][DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
2014-07-22 14:31:48,157 [salt.utils.schedule                         ][DEBUG   ] Running scheduled job: __mine_interval
2014-07-22 14:31:48,157 [salt.utils.schedule                         ][DEBUG   ] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
2014-07-22 14:31:48,158 [salt.utils.schedule                         ][DEBUG   ] schedule: This job was scheduled with a max number of 2
2014-07-22 14:31:48,175 [salt.utils.schedule                         ][DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20140722143148168090', 'pid': 4603, 'id': 'srv3.example.intern'}
2014-07-22 14:31:48,178 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:31:48,318 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 14:31:48,319 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:31:48,435 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:31:48,493 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:31:48,634 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 14:31:48,634 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:33:36,976 [salt.minion                                 ][INFO    ] User root Executing command cmd.run with jid 20140722143336972835
2014-07-22 14:33:36,976 [salt.minion                                 ][DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20140722143336972835', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': ['grep mgadrmfb /var/log/dmesg'], 'fun': 'cmd.run'}
2014-07-22 14:33:36,996 [salt.loaded.int.module.cmdmod               ][INFO    ] Executing command 'grep mgadrmfb /var/log/dmesg' in directory '/'
2014-07-22 14:33:37,007 [salt.loaded.int.module.cmdmod               ][INFO    ] output: [    4.376982] fbcon: mgadrmfb (fb0) is primary device
[    4.695993] mgag200 0000:03:00.0: fb0: mgadrmfb frame buffer device
2014-07-22 14:33:37,007 [salt.minion                                 ][INFO    ] Returning information for job: 20140722143336972835
2014-07-22 15:31:48,932 [salt.utils.schedule                         ][DEBUG   ] Running scheduled job: __mine_interval
2014-07-22 15:31:48,932 [salt.utils.schedule                         ][DEBUG   ] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
2014-07-22 15:31:48,933 [salt.utils.schedule                         ][DEBUG   ] schedule: This job was scheduled with a max number of 2
2014-07-22 15:31:48,945 [salt.utils.schedule                         ][DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20140722153148937826', 'pid': 5110, 'id': 'srv3.example.intern'}
2014-07-22 15:31:48,948 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 15:31:49,107 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 15:31:49,107 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 15:31:49,223 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 15:31:49,282 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 15:31:49,420 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 15:31:49,420 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:31:48,002 [salt.utils.schedule                         ][DEBUG   ] Running scheduled job: __mine_interval
2014-07-22 16:31:48,003 [salt.utils.schedule                         ][DEBUG   ] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
2014-07-22 16:31:48,003 [salt.utils.schedule                         ][DEBUG   ] schedule: This job was scheduled with a max number of 2
2014-07-22 16:31:48,015 [salt.utils.schedule                         ][DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20140722163148008141', 'pid': 5382, 'id': 'srv3.example.intern'}
2014-07-22 16:31:48,018 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:31:48,157 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 16:31:48,158 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:31:48,273 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:31:48,332 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:31:48,469 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 16:31:48,470 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:39:17,128 [salt.crypt                                  ][DEBUG   ] Failed to authenticate message
2014-07-22 16:39:17,129 [salt.minion                                 ][DEBUG   ] Waiting 10 seconds to re-authenticate
2014-07-22 16:39:27,139 [salt.minion                                 ][DEBUG   ] Attempting to authenticate with the Salt Master at 10.82.0.1
2014-07-22 16:39:27,140 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:39:27,199 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:39:27,337 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 16:39:27,337 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:39:27,453 [salt.minion                                 ][INFO    ] Authentication with master successful!
2014-07-22 16:39:27,453 [salt.minion                                 ][INFO    ] User root Executing command cmd.run with jid 20140722163917125243
2014-07-22 16:39:27,453 [salt.minion                                 ][DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20140722163917125243', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': ['date'], 'fun': 'cmd.run'}
2014-07-22 16:39:27,465 [salt.minion                                 ][INFO    ] User root Executing command saltutil.find_job with jid 20140722163922135841
2014-07-22 16:39:27,465 [salt.minion                                 ][DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20140722163922135841', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': ['20140722163917125243'], 'fun': 'saltutil.find_job'}
2014-07-22 16:39:27,468 [salt.loaded.int.module.cmdmod               ][INFO    ] Executing command 'date' in directory '/'
2014-07-22 16:39:27,478 [salt.loaded.int.module.cmdmod               ][INFO    ] output: Tue Jul 22 16:39:27 CEST 2014
2014-07-22 16:39:27,479 [salt.minion                                 ][INFO    ] Returning information for job: 20140722163917125243
2014-07-22 16:39:27,481 [salt.minion                                 ][INFO    ] Returning information for job: 20140722163922135841

srv8 minion debug log:

2014-07-22 14:45:45,407 [salt             ][INFO    ] Setting up the Salt Minion "srv8.example.intern"
2014-07-22 14:45:45,420 [salt.utils.process][DEBUG   ] Created pidfile: /run/salt-minion.pid
2014-07-22 14:45:45,421 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion
2014-07-22 14:45:45,484 [salt.config      ][DEBUG   ] Including configuration from '/etc/salt/minion.d/myMinion.conf'
2014-07-22 14:45:45,484 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion.d/myMinion.conf
2014-07-22 14:45:45,702 [salt.minion                              ][DEBUG   ] Attempting to authenticate with the Salt Master at 10.82.0.1
2014-07-22 14:45:45,716 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:45:45,790 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:45:45,944 [salt.crypt                               ][DEBUG   ] Decrypting the current master AES key
2014-07-22 14:45:45,944 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:45:46,082 [salt.minion                              ][INFO    ] Authentication with master successful!
2014-07-22 14:45:46,083 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:45:46,224 [salt.crypt                               ][DEBUG   ] Decrypting the current master AES key
2014-07-22 14:45:46,224 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:45:46,384 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:45:46,454 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion
2014-07-22 14:45:46,516 [salt.config                              ][DEBUG   ] Including configuration from '/etc/salt/minion.d/myMinion.conf'
2014-07-22 14:45:46,517 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion.d/myMinion.conf
2014-07-22 14:45:49,874 [salt.loader                              ][DEBUG   ] Loaded localemod as virtual locale
2014-07-22 14:45:49,875 [salt.loader                              ][DEBUG   ] Loaded groupadd as virtual group
2014-07-22 14:45:49,877 [salt.loader                              ][DEBUG   ] Loaded ebuild as virtual pkg
2014-07-22 14:45:49,881 [salt.loader                              ][DEBUG   ] Loaded linux_sysctl as virtual sysctl
2014-07-22 14:45:49,887 [salt.loader                              ][DEBUG   ] Loaded parted as virtual partition
2014-07-22 14:45:49,894 [salt.loader                              ][DEBUG   ] Loaded linux_acl as virtual acl
2014-07-22 14:45:49,903 [salt.loader                              ][DEBUG   ] Loaded sysmod as virtual sys
2014-07-22 14:45:49,910 [salt.loader                              ][DEBUG   ] Loaded zcbuildout as virtual buildout
2014-07-22 14:45:49,912 [salt.loader                              ][DEBUG   ] Loaded useradd as virtual user
2014-07-22 14:45:49,913 [salt.loader                              ][DEBUG   ] Loaded gentoo_service as virtual service
2014-07-22 14:45:49,925 [salt.loader                              ][DEBUG   ] Loaded virtualenv_mod as virtual virtualenv
2014-07-22 14:45:49,925 [salt.loader                              ][DEBUG   ] Loaded djangomod as virtual django
2014-07-22 14:45:49,930 [salt.loader                              ][DEBUG   ] Loaded cmdmod as virtual cmd
2014-07-22 14:45:49,935 [salt.loader                              ][DEBUG   ] Loaded gentoolkitmod as virtual gentoolkit
2014-07-22 14:45:49,936 [salt.loader                              ][DEBUG   ] Loaded linux_lvm as virtual lvm
2014-07-22 14:45:49,992 [salt.loader                                 ][DEBUG   ] Loaded syslog_return as virtual syslog
2014-07-22 14:45:49,992 [salt.loader                                 ][DEBUG   ] Loaded couchdb_return as virtual couchdb
2014-07-22 14:45:49,993 [salt.loader                                 ][DEBUG   ] Loaded carbon_return as virtual carbon
2014-07-22 14:45:49,993 [salt.minion                                 ][DEBUG   ] I am srv8.example.intern and I am not supposed to start any proxies.
2014-07-22 14:45:49,993 [salt.minion                                 ][INFO    ] Minion is starting as user 'root'
2014-07-22 14:45:49,993 [salt.minion                                 ][DEBUG   ] Minion 'srv8.example.intern' trying to tune in
2014-07-22 14:45:49,994 [salt.minion                                 ][DEBUG   ] Minion PUB socket URI: ipc:///var/run/salt/minion/minion_event_a71ed27d3670f6381b102a69f27141a8_pub.ipc
2014-07-22 14:45:49,994 [salt.minion                                 ][DEBUG   ] Minion PULL socket URI: ipc:///var/run/salt/minion/minion_event_a71ed27d3670f6381b102a69f27141a8_pull.ipc
2014-07-22 14:45:49,994 [salt.minion                                 ][DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (8725)
2014-07-22 14:45:49,994 [salt.minion                                 ][DEBUG   ] Setting zmq_reconnect_ivl to '8725ms'
2014-07-22 14:45:49,995 [salt.minion                                 ][DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
2014-07-22 14:45:50,507 [salt.utils.schedule                         ][DEBUG   ] Running scheduled job: __mine_interval
2014-07-22 14:45:50,508 [salt.utils.schedule                         ][DEBUG   ] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
2014-07-22 14:45:50,517 [salt.utils.schedule                         ][DEBUG   ] schedule: This job was scheduled with a max number of 2
2014-07-22 14:45:50,545 [salt.utils.schedule                         ][DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20140722144550533394', 'pid': 4272, 'id': 'srv8.example.intern'}
2014-07-22 14:45:50,548 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:45:50,693 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 14:45:50,694 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:45:50,832 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:45:50,903 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 14:45:51,047 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 14:45:51,048 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 15:45:50,443 [salt.utils.schedule                         ][DEBUG   ] Running scheduled job: __mine_interval
2014-07-22 15:45:50,443 [salt.utils.schedule                         ][DEBUG   ] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
2014-07-22 15:45:50,443 [salt.utils.schedule                         ][DEBUG   ] schedule: This job was scheduled with a max number of 2
2014-07-22 15:45:50,461 [salt.utils.schedule                         ][DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20140722154550449998', 'pid': 4503, 'id': 'srv8.example.intern'}
2014-07-22 15:45:50,464 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 15:45:50,614 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 15:45:50,614 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 15:45:50,753 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 15:45:50,824 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 15:45:50,965 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 15:45:50,965 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:39:17,126 [salt.crypt                                  ][DEBUG   ] Failed to authenticate message
2014-07-22 16:39:17,126 [salt.minion                                 ][DEBUG   ] Waiting 45 seconds to re-authenticate
2014-07-22 16:40:02,172 [salt.minion                                 ][DEBUG   ] Attempting to authenticate with the Salt Master at 10.82.0.1
2014-07-22 16:40:02,172 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:40:02,243 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:40:02,387 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-22 16:40:02,388 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-22 16:40:02,526 [salt.minion                                 ][INFO    ] Authentication with master successful!
2014-07-22 16:40:02,527 [salt.minion                                 ][INFO    ] User root Executing command cmd.run with jid 20140722163917125243
2014-07-22 16:40:02,527 [salt.minion                                 ][DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20140722163917125243', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': ['date'], 'fun': 'cmd.run'}
2014-07-22 16:40:02,545 [salt.minion                                 ][INFO    ] User root Executing command saltutil.find_job with jid 20140722163922135841
2014-07-22 16:40:02,546 [salt.minion                                 ][DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20140722163922135841', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': ['20140722163917125243'], 'fun': 'saltutil.find_job'}
2014-07-22 16:40:02,553 [salt.loaded.int.module.cmdmod               ][INFO    ] Executing command 'date' in directory '/'
2014-07-22 16:40:02,570 [salt.loaded.int.module.cmdmod               ][INFO    ] output: Tue Jul 22 16:40:02 CEST 2014
2014-07-22 16:40:02,575 [salt.minion                                 ][INFO    ] Returning information for job: 20140722163917125243
2014-07-22 16:40:02,580 [salt.minion                                 ][INFO    ] Returning information for job: 20140722163922135841

Because the re-authentication time is different per client I think that's why it feels like it takes 1-2 minutes until all minions are back like stated in https://github.com/saltstack/salt/issues/14343#issue-38215626

Now the question is why authentication will fail sometimes. I am wondering that the master doesn't log any details regarding the failed authentication.

basepi commented 9 years ago

Well, interestingly enough, in both cases the failed authentication was before the actual command was received by master, so it doesn't seem like that's actually the issue. The command is received and returned quickly.

Whissi commented 9 years ago

Here's another minion log:

2014-07-23 20:27:44,410 [salt.minion                                 ][INFO    ] Returning information for job: 20140723202605636921
2014-07-23 20:44:48,515 [salt.crypt                                  ][DEBUG   ] Failed to authenticate message
2014-07-23 20:44:48,515 [salt.minion                                 ][DEBUG   ] Waiting 12 seconds to re-authenticate
2014-07-23 20:45:00,527 [salt.minion                                 ][DEBUG   ] Attempting to authenticate with the Salt Master at 10.82.0.1
2014-07-23 20:45:00,528 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-23 20:45:00,598 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-23 20:45:00,747 [salt.crypt                                  ][DEBUG   ] Decrypting the current master AES key
2014-07-23 20:45:00,747 [salt.crypt                                  ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-23 20:45:00,886 [salt.minion                                 ][INFO    ] Authentication with master successful!
2014-07-23 20:45:00,886 [salt.minion                                 ][INFO    ] User root Executing command cmd.run with jid 20140723204448511760
2014-07-23 20:45:00,886 [salt.minion                                 ][DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20140723204448511760', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': ['date'], 'fun': 'cmd.run'}
2014-07-23 20:45:00,903 [salt.minion                                 ][INFO    ] User root Executing command saltutil.find_job with jid 20140723204453520459
2014-07-23 20:45:00,903 [salt.minion                                 ][DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20140723204453520459', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': ['20140723204448511760'], 'fun': 'saltutil.find_job'}
2014-07-23 20:45:00,914 [salt.loaded.int.module.cmdmod               ][INFO    ] Executing command 'date' in directory '/'
2014-07-23 20:45:00,933 [salt.loaded.int.module.cmdmod               ][INFO    ] output: Wed Jul 23 20:45:00 CEST 2014
2014-07-23 20:45:00,933 [salt.minion                                 ][INFO    ] Returning information for job: 20140723204448511760
2014-07-23 20:45:00,934 [salt.minion                                 ][INFO    ] Returning information for job: 20140723204453520459
basepi commented 9 years ago

Weird! For now I would recommend setting up a cron job with just a basic test.ping call every few minutes to keep it from timing out. This one's definitely strange, but we haven't been able to reproduce it yet!

steverweber commented 9 years ago

FYI the current develop version has a keepalive patch that might help keep the minion online.

the minion will ping the master every 10min... if the master don't respond the minion will restart. ping_interval: 10

The above might get you by for now but the root issue should still be investigated.

cachedout commented 9 years ago

I tend to wonder if this might be related to the ZeroMQ connection caching that was added in that time frame.

(cc: @jacksontj for his thoughts.)

pwaller commented 9 years ago

+1 I'm also getting this since 2014.7.1. It makes salt almost impossible to use.

jacksontj commented 9 years ago

Actually looking at the logs it doesn't seem to be related to my zmq caching since it happens so close to when the event happens (and before the command is run)-- and the channel re-use stuff is on rep/recv (the return side of things). The failure that you are seeing looks like the master AES key has rotated in that time. If you look at srv8 minion's log you can see the authentication failure message in the log at the exact time that the job was published. The only 2 places that print that exact message are in crypt.py (https://github.com/saltstack/salt/blob/develop/salt/crypt.py#L671) which both mean that either the AES key was bad or the pub job was malformed-- i'd bet on the AES key rotating. With that you are kicking in the acceptance_wait_time back-off (https://github.com/saltstack/salt/blob/develop/salt/minion.py#L874) which makes the re-auth much less aggressive. This was changed back to higher numbers (50s+) in this release, although I know for a fact that it has been set significantly lower in previous releases. This would cause your returns to take significantly longer to come back because the minion is taking significantly longer to get them (time of pub + time to re-auth + job time).

Now, the reason that this effects some minions and not all is that the re-auth delay is randomized, so it makes sense that "some" minions "don't" return, becuase they got a longer re-auth timer.

If your environment is relatively small you can tune down the acceptance_wait_time to have the minions re-auth faster, an maybe disable AES key rotation if you have it enabled. The ping every N seconds will help because it will force the minion to re-auth sooner (instead of waiting for a job) but it doesn't solve the problem. Basically its a tradeoff of minion availability and master load (for the re-auth).

If you want to play around with it more you can probably re-produce this by doing a salt-key -d on some specific minion key as this forces an AES key rotation, or otherwise forcing an AES key rotation

pwaller commented 9 years ago

@jacksontj, is what you just suggested consistent with running test.ping on all minions repeatedly and having it fail after ~12 seconds on the same set of minions in succession many times? > 5?

steverweber commented 9 years ago

please ckeck your system logs on the minon to be sure nothing strange correlates with the minion down time.

/var/log/kern.log
/var/log/syslog

this might sound a little silly, but could you log pings to the master...

crontab -e
@reboot ping masterhostname | perl -nle 'BEGIN {$|++} print scalar(localtime), " ", $_' > /var/log/pingtest

im not sure if the above command will work... if not dont worry about it :)

jacksontj commented 9 years ago

@pwaller Yes, the re-auth is a blocking call within the minion's main process. So, until it is able to re-auth no new jobs will be pulled in. Depending on the HWMs it may or may not process those jobs after the auth is finished.

drawsmcgraw commented 9 years ago

Just adding a +1 for me. Version 2014.1.7. Seems to be "several minutes" (2-10) before the minions are responsive again.

JustinCarmony commented 9 years ago

+1 for me as well, we've added a cronjob to test.ping every 5 minutes, but a fix in salt for this would be awesome. :)

UtahDave commented 9 years ago

Try adding random_reauth_delay: 5 recon_max: 1000

To your minion config

There are some really great optimizations coming soon, too On Aug 21, 2014 10:55 AM, "Justin Carmony" notifications@github.com wrote:

+1 for me as well, we've added a cronjob to test.ping every 5 minutes, but a fix in salt for this would be awesome. :)

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/14343#issuecomment-52957232.

Rucknar commented 9 years ago

@UtahDave I believe we've run into this since going to 2014.1.7, you mentioned some optimizations, should we upgrade to 2014.1.10 or are you talking about helium?

basepi commented 9 years ago

@superted666 I think most of the optimizations will be in 2014.7, not 2014.1.

Rucknar commented 9 years ago

Thanks @basepi I'll await the full release of 2014.7.0

drawsmcgraw commented 9 years ago

@superted666 As a data point, I tested the config changes @UtahDave mentioned and they really take the sting out of the bug. The minions don't come back immediately from a test.ping but they come back much faster than before the config changes. I'm not sitting there, waiting for a watch salt-run manage.status loop to tell me the minions are available again.

Whissi commented 9 years ago

What I don't understand: It sounds like you are focusing on the re-authentication. I.e. when the authentication is lost, you are working on improving the re-authentication (all the _random_reauthdelay and _reconmax stuff).

But isn't the root problem that sometimes an authenticated minion becomes invalid? https://github.com/saltstack/salt/issues/14343#issuecomment-51358443 indicates that it has something to do with the key rotation. So shouldn't we focus on fixing that? From my understanding key rotation should never cause a problem. If that's the root cause.. that's the bug.

jacksontj commented 9 years ago

We could work on the key rotation, its just not as simple as you'd think. Since the aes key is global when it rotates all minions need it on the next publish. Since all publishes go everywhere (unless you use zmq_filtering) all minions I'll immediately need to re-auth after the rotation. This usually causes the master to get overloaded and the minions to start their re-auth back off. It would be preferable if the minions didn't reconnect, but rather waited longer only because zmq has no way of dropping the message (since there is no way to know that the minion disconnected). On Sep 3, 2014 8:11 AM, "Thomas D." notifications@github.com wrote:

What I don't understand: It sounds like you are focusing on the re-authentication. I.e. when the authentication is lost, you are working on improving the re-authentication (all the _random_reauthdelay and _reconmax stuff).

But isn't the root problem that sometimes an authenticated minion becomes invalid? #14343 (comment) https://github.com/saltstack/salt/issues/14343#issuecomment-51358443 indicates that it has something to do with the key rotation. So shouldn't we focus on fixing that? From my understanding key rotation should never cause a problem. If that's the root cause.. that's the bug.

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/14343#issuecomment-54311814.

basepi commented 9 years ago

Right, architecturally there's no way to solve the issue of minions having to reauth without just disabling key rotation, which is less secure (and allows deleted minions to listen in on the encrypted channel).

The solution is probably to just force a reauth of all the minions immediately after key rotation. Maybe just send a test.ping to '*' so that this delay can be avoided altogether, unless you're running commands immediately after rotation.

jacksontj commented 9 years ago

In a lot of ways the reauth_delay stuff is what is actually killing us, since backoff (the sleep) just creates more auth requests. Maybe what we should do is change that backoff time to actually wait longer for the response instead of re-creating it.

On Mon, Sep 8, 2014 at 11:55 AM, Colton Myers notifications@github.com wrote:

Right, architecturally there's no way to solve the issue of minions having to reauth without just disabling key rotation, which is less secure (and allows deleted minions to listen in on the encrypted channel).

The solution is probably to just force a reauth of all the minions immediately after key rotation. Maybe just send a test.ping to '*' so that this delay can be avoided altogether, unless you're running commands immediately after rotation.

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/14343#issuecomment-54869202.

Whissi commented 9 years ago

test.ping is the current workaround, yes.

Is there a way to make key rotation visible (add the key rotation event to the debug log)? I am not yet convinced that this is the root cause. In other words: If it is the key rotation which kills the connection, why don't we see a time out when using the test.ping command every 5 minutes? Or are keys only rotating when not in use for x minutes? Why is test.ping a working workaround?

jacksontj commented 9 years ago

I've just submitted a pull req to log the aes key rotation on the master side (https://github.com/saltstack/salt/pull/15603). You can either pull that commit in and spin a new build or watch the directory for the file that is made (the .dfn file).

Key rotation in salt happens for a few different reasons:

Key rotation doesn't really "kill" the connection, it just means that the next pub job the minion gets will be encrypted using a different key-- meaning the minions needs to re-auth. The test.ping workaround will just make you hit that same re-auth sooner so that when you run a command its less likely that you have minions stuck in the re-auth state.

On Mon, Sep 8, 2014 at 12:29 PM, Thomas D. notifications@github.com wrote:

test.ping is the current workaround, yes.

Is there a way to make key rotation visible (add the key rotation event to the debug log)? I am not yet convinced that this is the root cause. In other words: If it is the key rotation which kills the connection, why don't we see a time out when using the test.ping command every 5 minutes? Or are keys only rotating when not in use for x minutes? Why is test.ping a working workaround?

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/14343#issuecomment-54873900.

Rucknar commented 9 years ago

Thanks for the info @jacksontj. I've not heard of publish_session before this ticket. Just had a look and by default it's set to 'publish_session': 86400,. I presume this is seconds? I wonder if were falsely triggering the rotation unintentionally, of interest, we also use a reauth_delay to avoid overloading our master.

jacksontj commented 9 years ago

Yea, that is in seconds-- so that would mean every 24h. You will also trigger a key rotation on every salt-key delete (-d or -D), not only every "publish_session" seconds.

We also use the reauth_delay stuff in our infrastructure, but the more we've been using it (at least in its current implementation) the more problems it seems to be causing ;)

On Tue, Sep 9, 2014 at 12:05 AM, superted666 notifications@github.com wrote:

Thanks for the info @jacksontj https://github.com/jacksontj. I've not heard of publish_session before this ticket. Just had a look and by default it's set to 'publish_session': 86400,. I presume this is seconds? I wonder if were falsely triggering the rotation unintentionally, of interest, we also use a reauth_delay to avoid overloading our master.

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/14343#issuecomment-54932256.

Whissi commented 9 years ago

OK, I applied https://github.com/saltstack/salt/pull/15603 to salt-2014.1.10 and deactivated the 5min test.ping cronjob used as workaround.

For a short moment I thought the problem is gone because everything was working for ~20 hours. No timeout.

But then it happens again. And thanks to the new log output I saw the reason:

2014-09-10 01:02:24,099 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3448
2014-09-10 01:02:24,453 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3441
2014-09-10 01:02:24,730 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3447
2014-09-10 01:11:44,017 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3444
2014-09-10 01:11:44,233 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3440

The first salt command after this was logged run into a timeout. All other following requests succeeded until

2014-09-11 01:02:24,848 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3444
2014-09-11 01:02:25,200 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3440
2014-09-11 01:02:25,477 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3448
2014-09-11 01:11:44,474 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3441
2014-09-11 01:11:44,690 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3447

To be clear:

  1. Keys will rotate at 01:02:00 (not sure why exactly at this time, but that doesn't matter at the moment).
  2. The next command after the master has rotated its key, will fail. It doesn't matter if you issue the next command at 01:15:00 or 23:55:00... the next command will always fail.

From my understanding (please correct me if I am wrong) this is happening because the master doesn't notify the minions "Hey, I have rotated my keys... check your authentication...". So when the first command with the new key gets published all the subscribed minions will notice "Mhh... cannot decrypt... let me skip this message and re-authenticate!".

This could be improved if

jacksontj commented 9 years ago

The notification of minions should be relatively easy (i'd think the reactor would be a great place for it). As for keeping the old key around until the new ones-- thats kinda hard for a couple reasons.

  1. The reason that the key is rotation (usually) is that we've revoked access to a minion.So, if we allow the old one, then we can't verify that the revoked minion won't have access to the data
  2. Publications are encrypted using the same symmetric key, so we end up with a variant of problem #1

As for the minion cannot decrypt thing, it actually does exactly what you describe (re-auth, then start processing where it left off). The issue is 2 part:

  1. Minion backoff timers start out REALLY high-- which is maybe good for large installs, but just slow for small ones
  2. Minion timeouts don't drop the auth request from the master, so by trying, disconnecting, and reconnecting your minion now has 2 auth requests in the master queue-- the only reason this ends up helping is due to the increased backoff-- which is more latency!

I've looked into solving this issue a bit, but zeromq makes solving a few of these problems difficult (primarily the backoff problem). We should probably change the minion backoff thing to just sleep instead of disconnecting, sleeping, and reconnecting. That would help master load in the auth storming, but still doesn't help with the latency of the minion re-connecting (which is hard to work around).

On Wed, Sep 10, 2014 at 6:07 PM, Thomas D. notifications@github.com wrote:

OK, I applied #15603 https://github.com/saltstack/salt/pull/15603 to salt-2014.1.10 and deactivated the 5min test.ping cronjob used as workaround.

For a short moment I thought the problem is gone because everything was working for ~20 hours. No timeout.

But then it happens again. And thanks to the new log output I saw the reason:

2014-09-10 01:02:24,099 [salt.master ][DEBUG ] New master AES key found by pid 3448 2014-09-10 01:02:24,453 [salt.master ][DEBUG ] New master AES key found by pid 3441 2014-09-10 01:02:24,730 [salt.master ][DEBUG ] New master AES key found by pid 3447 2014-09-10 01:11:44,017 [salt.master ][DEBUG ] New master AES key found by pid 3444 2014-09-10 01:11:44,233 [salt.master ][DEBUG ] New master AES key found by pid 3440

The first salt command after this was logged run into a timeout. All other following requests succeeded until

2014-09-11 01:02:24,848 [salt.master ][DEBUG ] New master AES key found by pid 3444 2014-09-11 01:02:25,200 [salt.master ][DEBUG ] New master AES key found by pid 3440 2014-09-11 01:02:25,477 [salt.master ][DEBUG ] New master AES key found by pid 3448 2014-09-11 01:11:44,474 [salt.master ][DEBUG ] New master AES key found by pid 3441 2014-09-11 01:11:44,690 [salt.master ][DEBUG ] New master AES key found by pid 3447

To be clear:

  1. Keys will rotate at 01:02:00 (not sure why exactly at this time, but that doesn't matter at the moment).
  2. The next command after the master has rotated its key, will fail. It doesn't matter if you issue the next command at 01:15:00 or 23:55:00... the next command will always fail.

From my understanding (please correct me if I am wrong) this is happening because the master doesn't notify the minions "Hey, I have rotated my keys... check your authentication...". So when the first command with the new key gets published all the subscribed minions will notice "Mhh... cannot decrypt... let me skip this message and re-authenticate!".

This could be improved if

-

The master would notify all minions about a changed key. Because the master knows all the minions which should react on test.ping this shouldn't be a problem. Also, the master could send this notification in batches to keep the load under control (the thing why we introduced the dynamic re-authentication delay).

  • Maybe the master must keep the old key until all minions got notified... but this shouldn't be a problem if the master will only use the old key for such notifications. Also, when the master knows that all minions were notified, the master could remove the old key.
  • When the minion will notice that it cannot decrypt the current message, the minion could keep the message in queue, start re-authentication and restart processing the queue.

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/14343#issuecomment-55206556.

cachedout commented 9 years ago

@jacksontj Regarding AES key rotation, there's some work over in #15722.

Whissi commented 9 years ago

As for the minion cannot decrypt thing, it actually does exactly what you describe (re-auth, then start processing where it left off).

Are you sure that this is happening? I mean, if this is actual happening, shouldn't I see the following work:

  1. Wait for a key change (in my case, wait for 01:15)
  2. Run salt '*' cmd.run 'touch /i-was-here'

The command will timeout - no minion will return (because that's the first command after the key was rotated).

But if you are right and the minion will detect a key change, re-authenticate and resume work, than /i-was-here should have a last-modified time after 01:15+backoff_timer_value... I am not sure about the current backofftimer value, but if I will check each minion at 02:15 (60 minutes after I run the touch command) I should see recently touched file /i-was-here_, not?

If this is not working (my feeling) /i-was-here should not exist/shouldn't be modified on any minion... because the job (touching /i-was-here) was lost.

I'll test that in 3 hours... :)

jacksontj commented 9 years ago

Guess I should be clear there too, for some reason there are different amounts of retries allowed at varying points of salt. Usually its at least 2 tries that are allowed.

And yes, for your test from what I expect to happen is something like

If the back-off time is relatively long its very likely that return will be after the CLI finished waiting (esp since the master CLI timeout is 5s by default).

On Fri, Sep 12, 2014 at 1:14 PM, Thomas D. notifications@github.com wrote:

As for the minion cannot decrypt thing, it actually does exactly what you describe (re-auth, then start processing where it left off).

Are you sure that this is happening? I mean, if this is actual happening, shouldn't I see the following work:

  1. Wait for a key change (in my case, wait for 01:15)
  2. Run salt '*' cmd.run 'touch /i-was-here'

The command will timeout - no minion will return (because that's the first command after the key was rotated).

But if you are right and the minion will detect a key change, re-authenticate and resume work, than /i-was-here should have a last-modified time after 01:15+backoff_timer_value... I am not sure about the current backofftimer value, but if I will check each minion at 02:15 (60 minutes after I run the touch command) I should see recently touched file /i-was-here_, not?

If this is not working (my feeling) /i-was-here should not exist/shouldn't be modified on any minion... because the job (touching /i-was-here) was lost.

I'll test that in 3 hours... :)

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/14343#issuecomment-55454959.

Whissi commented 9 years ago

@jacksontj OK, what you said is correct:

master ~ # grep -i found /var/log/salt/master
2014-09-14 01:02:25,236 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3440
2014-09-14 01:02:25,588 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3448
2014-09-14 01:02:25,865 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3441
2014-09-14 01:11:44,367 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3447
2014-09-14 01:11:44,590 [salt.master                                 ][DEBUG   ] New master AES key found by pid 3444

master ~ # date
Sun Sep 14 01:22:14 CEST 2014
master ~ # salt '*' --show-timeout cmd.run 'date > /salt.test && echo "Done!" || echo "Failed! :("'
srv2.example.intern:
    Minion did not return
srv3.example.intern:
    Minion did not return
srv4.example.intern:
    Minion did not return
srv5.example.intern:
    Minion did not return
srv6.example.intern:
    Minion did not return
srv7.example.intern:
    Minion did not return
srv8.example.intern:
    Minion did not return

master ~ # salt-run jobs.print_job 20140914012218141211
'20140914012218141211':
  Arguments:
  - date > /salt.test && echo "Done!" || echo "Failed! :("
  Function: cmd.run
  Result:
    srv2.example.intern: Done!
    srv3.example.intern: Done!
    srv4.example.intern: Done!
    srv5.example.intern: Done!
    srv6.example.intern: Done!
    srv7.example.intern: Done!
    srv8.example.intern: Done!
  StartTime: 2014, Sep 14 01:22:18.141211
  Target: '*'
  Target-type: glob
  User: root

master ~ # salt '*' --show-timeout cmd.run 'cat /salt.test'
srv2.example.intern:
    Sun Sep 14 01:22:50 CEST 2014
srv5.example.intern:
    Sun Sep 14 01:22:49 CEST 2014
srv6.example.intern:
    Sun Sep 14 01:23:08 CEST 2014
srv4.example.intern:
    Sun Sep 14 01:23:13 CEST 2014
srv3.example.intern:
    Sun Sep 14 01:23:04 CEST 2014
srv7.example.intern:
    Sun Sep 14 01:22:44 CEST 2014
srv8.example.intern:
    Sun Sep 14 01:22:35 CEST 2014

I don't understand the find_job return value salt will fire after gather_job_timeout. In this case:

# salt-run jobs.print_job 20140914012223152825
'20140914012223152825':
  Arguments:
  - '20140914012218141211'
  Function: saltutil.find_job
  Result:
    srv2.example.intern: {}
    srv3.example.intern: {}
    srv4.example.intern: {}
    srv5.example.intern: {}
    srv6.example.intern:
      arg:
      - date > /salt.test && echo "Done!" || echo "Failed! :("
      fun: cmd.run
      jid: '20140914012218141211'
      pid: 36977
      ret: ''
      tgt: '*'
      tgt_type: glob
      user: root
    srv7.example.intern: {}
    srv8.example.intern:
      arg:
      - date > /salt.test && echo "Done!" || echo "Failed! :("
      fun: cmd.run
      jid: '20140914012218141211'
      pid: 19054
      ret: ''
      tgt: '*'
      tgt_type: glob
      user: root
  StartTime: 2014, Sep 14 01:22:23.152825
  Target: '*'
  Target-type: glob
  User: root

From my understanding minion srv2, srv3, srv4, srv5 and srv7 are saying

No, I have no information regarding jid 20140914012218141211

but srv6 and srv8 are running this job. So I don't understand why these two minions are also listed with Minion did not return status.

Using the reactor system sounds nice. But we need a way to tell the CLI that the job isn't lost. Well, like shown, the job isn't lost, but when you are using the remote execution stuff and get such a response you will probably re-send the command. But this could be a problem if the command shouldn't/cannot run twice for example.

donatello commented 9 years ago

:+1: I am also affected by this problem. I have about a 100 minions distributed across many DCs. The problem happens quite often, as our servers scale up and scale down un-attended. The salt master loses connectivity to many minions and ends up disrupting other orchestrations when the AES key is rotated due to a minion being deleted.

Are there any stop gap measures that I can use, while this issue is being resolved?

basepi commented 9 years ago

On 2014.7.0, if you set rotate_aes_key: False in your master config and restart your master, it will stop rotating the aes key on key deletion/rejection.

This does allow minions with those keys to continue to eavesdrop on publishes, but they can't interact with the master anymore.

kennycoder commented 9 years ago

Same issues here... running Debian 7. I did setup a basic cron job to run every 5 minutes but it still doesn't help. Makes it quite unusable.

kelchm commented 9 years ago

I've been dealing with this issue for the past few weeks without much success in finding a solution. Running a test.ping cron every few minutes seemed to help, but really didn't fix the issue at all. Similarly, I saw no improvement after configuring rotate_aes_key: False.

Last night out of desperation I tried completely disabling the salt reactor configuration I had in place on my master. I've had zero issues with minions dropping or not returning since making this change.

basepi commented 9 years ago

@kelchm Just to verify, are you on 2014.7.0? Actually, the output of salt --versions-report would be helpful. This is potentially good information if we can find a pattern.

basepi commented 9 years ago

@kelchm We would also be interested to see your reactor config and sls files.

jacksontj commented 9 years ago

@kelchm AFAIK there should be nothing inherent in the reactor that would cause AES key rotation (and hence your problem). My best guess would be that one of your reactor events causes a key rotation (deleting a salt-minion key maybe?) But the reactor config and sls would help us nail that down.

frogunder commented 8 years ago

@Whissi - Is this still a problem for you?