"Failed to authenticate message" during minion restart while running highstate with 2014.7.0

kev009 commented 9 years ago

I have a minion running directly on the salt-master and it reliably triggers a race during bootstrap when the minion is restarting itself. There's a watch on the minion.conf that triggers a reload. After it reloads as part of the initial highstate, I see several of these per second on the master:

2014-12-09 14:10:10,799 [salt.crypt ][DEBUG ] Failed to authenticate message 2014-12-09 14:10:10,821 [salt.crypt ][DEBUG ] Failed to authenticate message 2014-12-09 14:10:10,843 [salt.crypt ][DEBUG ] Failed to authenticate message 2014-12-09 14:10:10,866 [salt.crypt ][DEBUG ] Failed to authenticate message 2014-12-09 14:10:10,888 [salt.crypt ][DEBUG ] Failed to authenticate message 2014-12-09 14:10:10,910 [salt.crypt ][DEBUG ] Failed to authenticate message

There are some other bugs with similar messages but they seem to be doing a key deletion during the run. This happens for me with auto accepted key and during a normal highstate run.

If I attach a debugger to the stuck process, I see that it's simply waiting for file data (i.e. salt://stuff/files/nsswitch.conf).

It seems to me that the session is not re-authenticating after the minion restart?

If I kill the job and restart the minion, things proceed correctly.

kev009 commented 9 years ago

Possibly relevant? https://github.com/saltstack/salt/issues/14499

kev009 commented 9 years ago

I tried adding this to the minion.conf:

ping_interval: 1
auth_safemode: True
restart_on_error: True

No recovery after 15 pings:

2014-12-09 14:54:06,179 [salt.minion      ][DEBUG   ] Ping master
2014-12-09 14:54:06,295 [salt.crypt       ][DEBUG   ] Decrypting the current master AES key
2014-12-09 14:54:06,296 [salt.crypt       ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem

kev009 commented 9 years ago

The state that causes the reload looks like this:

# vim: sts=2 ts=2 sw=2 et ai

{% from 'pepper/map.jinja' import defaults with context %}
{% from 'salt/map.jinja' import salt_map, minion_config with context %}

minion_conf_clean_up:
  file.absent:
    - names:
      - {{ defaults['config-path'] }}/salt/minion.d/99-master-address.conf

salt-minion:
  pkg.installed:
    - name: {{ salt_map['salt-minion'] }}

salt-minion-conf:
  file.managed:
    - name: {{ defaults['config-path'] }}/salt/minion.d/minion.conf
    - template: jinja
    - makedirs: True
    - source: salt://salt/templates/minion.jinja
    - config: {{ minion_config }}

salt-minion-running:
  service.running:
    - enable: True
    - name: {{ salt_map['minion-service'] }}
    - watch:
      - pkg: {{ salt_map['salt-minion'] }}
      - file: minion_conf_clean_up
      - file: salt-minion-conf

kev009 commented 9 years ago

It spins pretty aggressively here: https://github.com/saltstack/salt/blob/2014.7/salt/fileclient.py#L998

kev009 commented 9 years ago

Is there best practice or workaround for salting a minion?

basepi commented 9 years ago

@kev009 can you clarify your final question? The best practice for bootstrapping a salt-minion is the salt-bootstrap script.

On some service systems, salt does have a little bit of trouble restarting itself. The most consistently successful method I've seen is using the at utility, as shown in this comment: https://github.com/saltstack/salt/issues/7997#issuecomment-30978123

kev009 commented 9 years ago

@basepi yes, I am bootstrapping with salt-bootstrap. I'm calling 'salt * state.highstate' manually, with only a master and a minion running on the master, to apply the initial minion conf on that master. Results are hangs as chronicled here.

The system is Ubuntu trusty, running the stable Helium repos/deps.

basepi commented 9 years ago

Can you try using a cmd.run with the at command I mentioned above and an onchanges requisite to emulate the watch behavior you're using now? Wondering if that will help for now.

kev009 commented 9 years ago

@basepi sure, I will give this a shot for now.

This will hamper bootstraping and testing. For instance, I want to test highstate on a fresh master in a CI environment. Some of the minion configs are needed to reload to get full convergence of the state there.

basepi commented 9 years ago

Yep, we definitely need to get minion and master restarts solid. Thanks for the feedback.

JaseFace commented 9 years ago

It looks like https://github.com/saltstack/salt/issues/5721 is the logged bug for minions not being able to restart themselves. It sounds like using 'at' or some external service to manage the minion restart is the recommended method until the key renegotiation issues on a restart can be resolved?

kev009 commented 9 years ago

@basepi I'm not super pleased with this but it's a sledgehammer in an idempotent bootstrap. Empirically, it appears to DTRT.

salt master state.highstate for the first run ~~unfortunately gives no job feedback~~ fixed with soft timeout in next comment. The second run will finish up and confirm convergence.

The service stop is used to quiesce any new jobs, then the pkill prevents any races with long running states or hangs with stuck states (the 'at' method is particularly vulnerable to these problems)

salt-minion-reload:
  llcmd.wait:
    - name: |
        nohup echo "sleep 10 && salt-call --local service.stop {{ salt_map['minion-service'] }}; pkill -f 'python.*{{ salt_map['minion-process'] }}'; salt-call --local service.restart {{ salt_map['minion-service'] }}" | sh -s
    - python_shell: True
    - timeout: 2
    - order: last
    - watch:
      - pkg: salt-minion-pkg
      - file: minion_conf_clean_up
      - file: salt-minion-conf

kev009 commented 9 years ago

I had to monkey patch the cmd.watch state module to get this all to work. Will open a PR for soft timeouts in the cmd execution module.

# -*- coding: utf-8 -*-

'''
Wrappers for cmd.run and cmd.wait that ignore timeouts (i.e. for nohup)
'''

import salt.loader

def run(name, **kwargs):
    '''
    Facade for cmd.run
    '''
    # https://github.com/saltstack/salt/issues/3513
    opts = __opts__.copy()
    opts['grains'] = __grains__
    __states__ = salt.loader.states(opts, __salt__)

    trigger = 'Timed out after'

    ret = __states__['cmd.run'](name, **kwargs)
    if 'retcode' in ret['changes'] and 'stdout' in ret['changes'] \
            and ret['changes']['retcode'] == 1 \
            and trigger in ret['changes']['stdout']:
        ret['changes']['retcode'] = 0
        ret['result'] = True
    return ret

def wait(name, **kwargs):
    '''
    Facade for cmd.wait
    '''
    # Noop. The state system will call the mod_watch function instead.
    return {'name': name, 'changes': {}, 'result': True, 'comment': ''}

mod_watch = run

rallytime commented 9 years ago

@kev009 We're making rounds on some old issues. It looks like you said you had a fix for this, but I don't see a linked pull request that fixed the issue here. Did you ever submit one? Is this still an issue on newer versions of salt? (We're currently on 2015.5.1.)

kev009 commented 9 years ago

@rallytime the workaround is the nohup restart with a cmd.run/wait that does not timeout. Should I submit those changes to the cmdmod?

thatch45 commented 9 years ago

Please do :)

rallytime commented 9 years ago

@kev009 Thanks for submitting that pull request. I've just merged it in. Is this issue good to close now?

kev009 commented 9 years ago

@rallytime this doesn't actually solve the bug request. That will need somebody that can dig into an entire salt run life cycle and somehow trigger renegotiation with the master if this happens.

I think at minimum we need to document somewhere the workaround using cmd w/ignore_timeout:

salt-minion-reload:
  cmd.wait:
    - name: |
        nohup echo "sleep 10 && salt-call --local service.stop {{ salt_map['minion-service'] }}; pkill -f 'python.*{{ salt_map['minion-process'] }}'; salt-call --local service.restart {{ salt_map['minion-service'] }}" | sh -s
    - ignore_timeout: True
    - python_shell: True
    - timeout: 2
    - order: last
    - watch:
      - pkg: salt-minion-pkg
      - file: salt-minion-conf

rallytime commented 9 years ago

Ok thanks for clarifying.

@jacobhammons when you get a moment can you add something like this to the documentation (mentioned in the above comment) until this bug can be fully resolved?

vutny commented 8 years ago

@rallytime I guess it is already covered in FAQ: https://docs.saltstack.com/en/latest/faq.html#what-is-the-best-way-to-restart-a-salt-daemon-using-salt

rallytime commented 8 years ago

Good find @vutny. I'll remove the "documentation" label.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

saltstack / salt

"Failed to authenticate message" during minion restart while running highstate with 2014.7.0 #18835