saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.19k stars 5.48k forks source link

[BUG] Cannot clean up corrupted cache when pillar_cache_backend=disk #62527

Closed rgeoghegan closed 1 year ago

rgeoghegan commented 2 years ago

Description If I use the pillar_cache_backend: "disk" config option, and the on-disk msgpack file for a minion gets corrupted, the pillar is now blank, and any attempt to run pillar.clear_pillar_cache crashes, even after restarting the salt-master.

Setup

I am using salt 3004.1 from the yum repo:

[root@saltmaster /]# yum info salt-master
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
 * base: mirror.its.dal.ca
 * extras: centos.les.net
 * updates: mirror.its.dal.ca
Installed Packages
Name        : salt-master
Arch        : noarch
Version     : 3004.2
Release     : 1.el7
Size        : 3.2 M
Repo        : installed
From repo   : salt-3004-repo
Summary     : Management component for salt, a parallel remote execution system
URL         : http://saltstack.org/
License     : ASL 2.0
Description : The Salt master is the central server to which all minions connect.
            : Supports Python 3.

I setup a system with a master and a minion with one pillar file:

my_pillar.sls

my_pillar:
  salt_rules: "rules"

Steps to Reproduce the behavior

I start with my pillar working as expected:

[root@saltmaster /]# salt \* pillar.items
saltminion:
    ----------
    my_pillar:
        ----------
        salt_rules:
            rules

I add stuff to the pillar file to make it an invalid msgpack file:

[root@saltmaster /]# echo fff >> /var/cache/salt/master/pillar_cache/saltminion

Now my pillar is reported as empty:

[root@saltmaster /]# salt \* pillar.items
saltminion:
    ----------

And the master log has an exception:

[INFO    ] 17:13:09 User root Published command pillar.items with jid 20220824171309872817
[ERROR   ] 17:13:10 Error in function _pillar:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/master.py", line 1917, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python3.6/site-packages/salt/master.py", line 1611, in _pillar
    extra_minion_data=load.get("extra_minion_data"),
  File "/usr/lib/python3.6/site-packages/salt/pillar/__init__.py", line 81, in get_pillar
    pillarenv=pillarenv,
  File "/usr/lib/python3.6/site-packages/salt/pillar/__init__.py", line 408, in __init__
    minion_cache_path=self._minion_cache_path(minion_id),
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 34, in factory
    return CacheDisk(ttl, kwargs["minion_cache_path"], *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 89, in __init__
    self._read()
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 147, in _read
    salt.utils.msgpack.load(fp_, encoding=__salt_system_encoding__)
  File "/usr/lib/python3.6/site-packages/salt/utils/msgpack.py", line 145, in unpack
    return msgpack.unpack(stream, **_sanitize_msgpack_unpack_kwargs(kwargs))
  File "/usr/lib64/python3.6/site-packages/msgpack/__init__.py", line 57, in unpack
    return unpackb(data, **kwargs)
  File "msgpack/_unpacker.pyx", line 209, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.
[INFO    ] 17:13:10 Got return from saltminion for job 20220824171309872817

Running clear_pillar_cache does not work:

[root@saltmaster /]# salt-run pillar.clear_pillar_cache
Exception occurred in runner pillar.clear_pillar_cache: Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/client/mixins.py", line 390, in low
    data["return"] = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
    return callable(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/runners/pillar.py", line 140, in clear_pillar_cache
    __opts__, grains, id_, saltenv, pillarenv=pillarenv
  File "/usr/lib/python3.6/site-packages/salt/pillar/__init__.py", line 408, in __init__
    minion_cache_path=self._minion_cache_path(minion_id),
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 34, in factory
    return CacheDisk(ttl, kwargs["minion_cache_path"], *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 89, in __init__
    self._read()
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 147, in _read
    salt.utils.msgpack.load(fp_, encoding=__salt_system_encoding__)
  File "/usr/lib/python3.6/site-packages/salt/utils/msgpack.py", line 145, in unpack
    return msgpack.unpack(stream, **_sanitize_msgpack_unpack_kwargs(kwargs))
  File "/usr/lib64/python3.6/site-packages/msgpack/__init__.py", line 57, in unpack
    return unpackb(data, **kwargs)
  File "msgpack/_unpacker.pyx", line 209, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.
[root@saltmaster /]# echo $?
0
[root@saltmaster /]# salt \* pillar.items
saltminion:
    ----------
[root@saltmaster /]#

And all this behaviour persists even if the salt-master is restarted.

If I delete the cache file, everything returns to normal:

[root@saltmaster /]# rm -f /var/cache/salt/master/pillar_cache/saltminion
[root@saltmaster /]# salt \* pillar.items
saltminion:
    ----------
    my_pillar:
        ----------
        salt_rules:
            rules

Expected behavior

IMHO, an unreadable cache file should be treated as a missing cache, and just cause the pillar to be rebuilt.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml [root@saltmaster /]# salt --versions-report Salt Version: Salt: 3004.2 Dependency Versions: cffi: Not Installed cherrypy: Not Installed dateutil: Not Installed docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 2.11.1 libgit2: Not Installed M2Crypto: 0.35.2 Mako: Not Installed msgpack: 0.6.2 msgpack-pure: Not Installed mysql-python: Not Installed pycparser: Not Installed pycrypto: Not Installed pycryptodome: Not Installed pygit2: Not Installed Python: 3.6.8 (default, Nov 16 2020, 16:55:22) python-gnupg: Not Installed PyYAML: 3.13 PyZMQ: 17.0.0 smmap: Not Installed timelib: Not Installed Tornado: 4.5.3 ZMQ: 4.1.4 System Versions: dist: centos 7 Core locale: UTF-8 machine: x86_64 release: 5.10.104-linuxkit system: Linux version: CentOS Linux 7 Core ```
welcome[bot] commented 2 years ago

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

rgeoghegan commented 2 years ago

@Ch3LL Hi! I was on the salt community call last week, and I promised to file the bug I was trying to describe.

What I could also do is submit a patch which just wraps the msgpack reading thing with a try:...except: and treat any relevant msgpack, file-not-found, etc exception the same as 'the file is missing', which should cause the proper cache to be rebuilt and saved properly.

Ch3LL commented 2 years ago

Looks like I'm able to replicate this. If you submit a PR, I will be more than willing to review and test it. I haven't gone into the code yet, but I will when you submit the PR and make sure its the correct fix.

rgeoghegan commented 2 years ago

FYI it took a while because I was out on vacation, but I just put up the PR.

dwoz commented 2 years ago

@rgeoghegan Do we know the reason this file is getting corrupted in the first place?

rgeoghegan commented 2 years ago

@dwoz Nothing is specifically corrupting the file, but I was playing with clearing the pillar cache file by just deleting it, and noticed a race condition in the code (along with this bug), and saw that if the file is corrupted, there is no way to recover other than manually deleting the disk cache file.

Ch3LL commented 1 year ago

Closed by https://github.com/saltstack/salt/pull/62760