saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.2k stars 5.48k forks source link

[BUG] Minion returns randomly missing in batch mode #58502

Open avgoor opened 4 years ago

avgoor commented 4 years ago

Description

Sometimes some minion returns are missing in the output of a batch mode command. For a command like salt -C 'I@salt:minion' --out=json -b'70%' cmd.run "salt-call cmd.run 'echo OK'" the output might look like:

....<<skip>>....
{
    "jid": "20200919150026255687",
    "retcode": 0,
    "node-11": "[INFO    ] Executing command 'echo OK' in directory '/root'\nlocal:\n    OK"
}
{
    "jid": "20200919150027708040",
    "node-12": "[INFO    ] Executing command 'echo OK' in directory '/root'\nlocal:\n    OK",
    "retcode": 0
}
{
    "node-6": {}
}

Although all minions are alive, in some cases some outputs are missing. The minions are not overloaded and there are no networking issues. Oftentimes, in subsequent runs the same minion outputs are missing. When the issue occurs, a "find_job" job with the target of the missing minion appears in jobs.list_jobs:

20200919162151938541:
    ----------
    Arguments:
        - 20200919162051760278
    Function:
        saltutil.find_job
    StartTime:
        2020, Sep 19 16:21:51.938541
    Target:
        - node-6
    Target-type:
        list
    User:
        salt-user

Setup This particular setup has 23 minions but the configuration is tuned for much bigger environments. master.conf:

sock_pool_size: 15
zmq_backlog: 3000
gather_job_timeout: 30
worker_threads: 20
timeout: 60
max_open_files: 15000
state_output: changes
pillar_opts: False
pillar_safe_render_error: True
auto_accept: True
max_event_size: 100000000

minion.conf:

master: 10.0.0.1
id: node-6
acceptance_wait_time: 10
acceptance_wait_time_max: 60
auth_timeout: 180
master_tries: -1
max_event_size: 100000000
random_reauth_delay: 270
recon_default: 1000
recon_max: 60000
recon_randomize: True

Steps to Reproduce the behavior It's enough to run salt -C 'I@salt:minion' --out=json -b'70%' cmd.run "salt-call cmd.run 'echo OK'" several times. On my setup with 23 minions it gives almost 50% reproduction rate.

Expected behavior All returns from all minions are present in the output.

Versions Report

Initially caught on 2017.7.x series, reproduced also on 3000.3.

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ``` Salt Version: Salt: 3000.3 Dependency Versions: cffi: 1.9.1 cherrypy: 3.5.0 dateutil: 2.4.2 docker-py: 1.9.0 gitdb: Not Installed gitpython: Not Installed Jinja2: 2.8 libgit2: Not Installed M2Crypto: 0.21.1 Mako: Not Installed msgpack-pure: Not Installed msgpack-python: 0.6.2 mysql-python: Not Installed pycparser: 2.14 pycrypto: 2.6.1 pycryptodome: Not Installed pygit2: Not Installed Python: 2.7.12 (default, Nov 12 2018, 14:36:49) python-gnupg: 0.3.8 PyYAML: 3.11 PyZMQ: 15.2.0 smmap: Not Installed timelib: Not Installed Tornado: 4.5.3 ZMQ: 4.1.4 System Versions: dist: Ubuntu 16.04 xenial locale: UTF-8 machine: x86_64 release: 4.15.0-43-generic system: Linux version: Ubuntu 16.04 xenial ```

Additional context

Only the batch client is affected, does not reproduce with any other client.

welcome[bot] commented 4 years ago

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at core@saltstack.com or reach out directly to the Community Manager, Cassandra Faris via Slack. We’re glad you’ve joined our community and look forward to doing awesome things with you!

Ch3LL commented 4 years ago

Thanks for the PR and welcome :)

dmurphy18 commented 2 years ago

@avgoor Closing this issue since as you said above, fixed with https://github.com/openSUSE/salt/pull/360 which is merged. This should have been closed automatically but the format in the Pull Request looks a little off from what the bot would have expected. If the problem is still occurring, can you open a new issue for it since this one references Python 2.7 which is no longer supported due to it's EOL.

Ch3LL commented 2 years ago

Re-opening. That PR is not in the salt repo. This PR is https://github.com/saltstack/salt/pull/58503 and is pending review.