Closed kev009 closed 6 years ago
Possibly relevant? https://github.com/saltstack/salt/issues/14499
I tried adding this to the minion.conf:
ping_interval: 1
auth_safemode: True
restart_on_error: True
No recovery after 15 pings:
2014-12-09 14:54:06,179 [salt.minion ][DEBUG ] Ping master
2014-12-09 14:54:06,295 [salt.crypt ][DEBUG ] Decrypting the current master AES key
2014-12-09 14:54:06,296 [salt.crypt ][DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
The state that causes the reload looks like this:
# vim: sts=2 ts=2 sw=2 et ai
{% from 'pepper/map.jinja' import defaults with context %}
{% from 'salt/map.jinja' import salt_map, minion_config with context %}
minion_conf_clean_up:
file.absent:
- names:
- {{ defaults['config-path'] }}/salt/minion.d/99-master-address.conf
salt-minion:
pkg.installed:
- name: {{ salt_map['salt-minion'] }}
salt-minion-conf:
file.managed:
- name: {{ defaults['config-path'] }}/salt/minion.d/minion.conf
- template: jinja
- makedirs: True
- source: salt://salt/templates/minion.jinja
- config: {{ minion_config }}
salt-minion-running:
service.running:
- enable: True
- name: {{ salt_map['minion-service'] }}
- watch:
- pkg: {{ salt_map['salt-minion'] }}
- file: minion_conf_clean_up
- file: salt-minion-conf
It spins pretty aggressively here: https://github.com/saltstack/salt/blob/2014.7/salt/fileclient.py#L998
Is there best practice or workaround for salting a minion?
@kev009 can you clarify your final question? The best practice for bootstrapping a salt-minion is the salt-bootstrap script.
On some service systems, salt does have a little bit of trouble restarting itself. The most consistently successful method I've seen is using the at
utility, as shown in this comment: https://github.com/saltstack/salt/issues/7997#issuecomment-30978123
@basepi yes, I am bootstrapping with salt-bootstrap. I'm calling 'salt * state.highstate' manually, with only a master and a minion running on the master, to apply the initial minion conf on that master. Results are hangs as chronicled here.
The system is Ubuntu trusty, running the stable Helium repos/deps.
Can you try using a cmd.run
with the at
command I mentioned above and an onchanges
requisite to emulate the watch
behavior you're using now? Wondering if that will help for now.
@basepi sure, I will give this a shot for now.
This will hamper bootstraping and testing. For instance, I want to test highstate on a fresh master in a CI environment. Some of the minion configs are needed to reload to get full convergence of the state there.
Yep, we definitely need to get minion and master restarts solid. Thanks for the feedback.
It looks like https://github.com/saltstack/salt/issues/5721 is the logged bug for minions not being able to restart themselves. It sounds like using 'at' or some external service to manage the minion restart is the recommended method until the key renegotiation issues on a restart can be resolved?
@basepi I'm not super pleased with this but it's a sledgehammer in an idempotent bootstrap. Empirically, it appears to DTRT.
salt master state.highstate for the first run unfortunately gives no job feedback fixed with soft timeout in next comment. The second run will finish up and confirm convergence.
The service stop is used to quiesce any new jobs, then the pkill prevents any races with long running states or hangs with stuck states (the 'at' method is particularly vulnerable to these problems)
salt-minion-reload:
llcmd.wait:
- name: |
nohup echo "sleep 10 && salt-call --local service.stop {{ salt_map['minion-service'] }}; pkill -f 'python.*{{ salt_map['minion-process'] }}'; salt-call --local service.restart {{ salt_map['minion-service'] }}" | sh -s
- python_shell: True
- timeout: 2
- order: last
- watch:
- pkg: salt-minion-pkg
- file: minion_conf_clean_up
- file: salt-minion-conf
I had to monkey patch the cmd.watch state module to get this all to work. Will open a PR for soft timeouts in the cmd execution module.
# -*- coding: utf-8 -*-
'''
Wrappers for cmd.run and cmd.wait that ignore timeouts (i.e. for nohup)
'''
import salt.loader
def run(name, **kwargs):
'''
Facade for cmd.run
'''
# https://github.com/saltstack/salt/issues/3513
opts = __opts__.copy()
opts['grains'] = __grains__
__states__ = salt.loader.states(opts, __salt__)
trigger = 'Timed out after'
ret = __states__['cmd.run'](name, **kwargs)
if 'retcode' in ret['changes'] and 'stdout' in ret['changes'] \
and ret['changes']['retcode'] == 1 \
and trigger in ret['changes']['stdout']:
ret['changes']['retcode'] = 0
ret['result'] = True
return ret
def wait(name, **kwargs):
'''
Facade for cmd.wait
'''
# Noop. The state system will call the mod_watch function instead.
return {'name': name, 'changes': {}, 'result': True, 'comment': ''}
mod_watch = run
@kev009 We're making rounds on some old issues. It looks like you said you had a fix for this, but I don't see a linked pull request that fixed the issue here. Did you ever submit one? Is this still an issue on newer versions of salt? (We're currently on 2015.5.1.)
@rallytime the workaround is the nohup restart with a cmd.run/wait that does not timeout. Should I submit those changes to the cmdmod?
Please do :)
@kev009 Thanks for submitting that pull request. I've just merged it in. Is this issue good to close now?
@rallytime this doesn't actually solve the bug request. That will need somebody that can dig into an entire salt run life cycle and somehow trigger renegotiation with the master if this happens.
I think at minimum we need to document somewhere the workaround using cmd w/ignore_timeout:
salt-minion-reload:
cmd.wait:
- name: |
nohup echo "sleep 10 && salt-call --local service.stop {{ salt_map['minion-service'] }}; pkill -f 'python.*{{ salt_map['minion-process'] }}'; salt-call --local service.restart {{ salt_map['minion-service'] }}" | sh -s
- ignore_timeout: True
- python_shell: True
- timeout: 2
- order: last
- watch:
- pkg: salt-minion-pkg
- file: salt-minion-conf
Ok thanks for clarifying.
@jacobhammons when you get a moment can you add something like this to the documentation (mentioned in the above comment) until this bug can be fully resolved?
@rallytime I guess it is already covered in FAQ: https://docs.saltstack.com/en/latest/faq.html#what-is-the-best-way-to-restart-a-salt-daemon-using-salt
Good find @vutny. I'll remove the "documentation" label.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.
I have a minion running directly on the salt-master and it reliably triggers a race during bootstrap when the minion is restarting itself. There's a watch on the minion.conf that triggers a reload. After it reloads as part of the initial highstate, I see several of these per second on the master:
2014-12-09 14:10:10,799 [salt.crypt ][DEBUG ] Failed to authenticate message 2014-12-09 14:10:10,821 [salt.crypt ][DEBUG ] Failed to authenticate message 2014-12-09 14:10:10,843 [salt.crypt ][DEBUG ] Failed to authenticate message 2014-12-09 14:10:10,866 [salt.crypt ][DEBUG ] Failed to authenticate message 2014-12-09 14:10:10,888 [salt.crypt ][DEBUG ] Failed to authenticate message 2014-12-09 14:10:10,910 [salt.crypt ][DEBUG ] Failed to authenticate message
There are some other bugs with similar messages but they seem to be doing a key deletion during the run. This happens for me with auto accepted key and during a normal highstate run.
If I attach a debugger to the stuck process, I see that it's simply waiting for file data (i.e. salt://stuff/files/nsswitch.conf).
It seems to me that the session is not re-authenticating after the minion restart?
If I kill the job and restart the minion, things proceed correctly.