saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.1k stars 5.47k forks source link

[BUG] salt minion data cache (on master) does not reconnect to MySQL if the connection bounces. #61417

Open tkwilliams opened 2 years ago

tkwilliams commented 2 years ago

AWS frequently auto-applies bug-fixes and/or minor version updates during maintenance windows, generally causing a short outage of the RDS instance(s) affected. I've found that each time this happens, I'm forced to manually login to the salt masters and service salt-master restart, due to the masters failing to retry after the timeout.

I would expect that for such simple cases, the master would have the option of (or simply the default behaviour of) retrying the DB backend until it becomes available again. Traceback from master log:

2022-01-04 18:24:16,692 [salt.master      :1884][ERROR   ][3636] Error in function _pillar:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 756, in _write_bytes
    self._sock.sendall(data)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/salt/cache/mysql_cache.py", line 108, in run_query
    out = cur.execute(query, args)
  File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 148, in execute
    result = self._query(query)
  File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 310, in _query
    conn.query(q)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 547, in query
    self._execute_command(COMMAND.COM_QUERY, sql)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 814, in _execute_command
    self._write_bytes(packet)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 760, in _write_bytes
    CR.CR_SERVER_GONE_ERROR, "MySQL server has gone away (%r)" % (e,)
pymysql.err.OperationalError: (2006, "MySQL server has gone away (TimeoutError(110, 'Connection timed out'))")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/salt/master.py", line 1878, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python3.7/site-packages/salt/master.py", line 1580, in _pillar
    {"grains": load["grains"], "pillar": data},
  File "/usr/lib/python3.7/site-packages/salt/cache/__init__.py", line 145, in store
    return self.modules[fun](bank, key, data, **self._kwargs)
  File "/usr/lib/python3.7/site-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3.7/site-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3.7/site-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3.7/site-packages/salt/cache/mysql_cache.py", line 208, in store
    cur, cnt = run_query(__context__.get("mysql_client"), query, args)
  File "/usr/lib/python3.7/site-packages/salt/cache/mysql_cache.py", line 125, in run_query
    raise SaltCacheError("Error running {}: {}".format(query, e))
salt.exceptions.SaltCacheError: Error running REPLACE INTO cache (bank, etcd_key, data) values(%s,%s,%s): (2006, "MySQL server has gone away (TimeoutError(110, 'Connection timed out'))")

Master deets:

Salt Version:
          Salt: 3004

Dependency Versions:
          cffi: 1.14.6
      cherrypy: 5.6.0
      dateutil: Not Installed
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.10
       libgit2: 1.1.0
      M2Crypto: Not Installed
          Mako: 1.1.4
       msgpack: 0.5.6
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.20
      pycrypto: Not Installed
  pycryptodome: 3.6.1
        pygit2: 1.6.1
        Python: 3.7.10 (default, Jun  3 2021, 00:02:01)
  python-gnupg: Not Installed
        PyYAML: 4.2
         PyZMQ: 17.0.0
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.2.3

System Versions:
          dist: amzn 2
        locale: UTF-8
       machine: x86_64
       release: 4.14.232-177.418.amzn2.x86_64
        system: Linux
       version: Amazon Linux 2

Thanks! t.

OrangeDog commented 2 years ago

What command triggers the exception? Does repeating it after the DB has come back not work?

tkwilliams commented 2 years ago

The exception occurs with any attempt to marshal pillar (salt-call pillar.items in this case) and of course any state functions which implicitly call into pillar. The above was actually run well after the DB was already back -- an automatic update had been performed the night before during the scheduled maintenance window. It's happened often enough now that I do a pillar.items first thing whenever I touch a node, just to see if the master has lost connex to the DB again :)

That's the reason for this ticket - if the master has once lost its binding to the DB, it will not reconnect until the salt-master service is restarted.

tkwilliams commented 2 years ago

Partial fix at https://github.com/saltstack/salt/pull/61906

tkwilliams commented 2 years ago

OK, current v3004.1 is completely broken WRT minion cache, unrelated to this fix. This patch works on my older v3004, but v3004.1 is borked, with or without this patch.

Closing this one - will submit a larger patch to resolve 3004.1 breakage along with this smaller fix.

tkwilliams commented 2 years ago

Oook, I meant to close the PR, not the original issue. Re-opening....