Open tkwilliams opened 2 years ago
What command triggers the exception? Does repeating it after the DB has come back not work?
The exception occurs with any attempt to marshal pillar (salt-call pillar.items
in this case) and of course any state functions which implicitly call into pillar. The above was actually run well after the DB was already back -- an automatic update had been performed the night before during the scheduled maintenance window. It's happened often enough now that I do a pillar.items first thing whenever I touch a node, just to see if the master has lost connex to the DB again :)
That's the reason for this ticket - if the master has once lost its binding to the DB, it will not reconnect until the salt-master service is restarted.
Partial fix at https://github.com/saltstack/salt/pull/61906
OK, current v3004.1 is completely broken WRT minion cache, unrelated to this fix. This patch works on my older v3004, but v3004.1 is borked, with or without this patch.
Closing this one - will submit a larger patch to resolve 3004.1 breakage along with this smaller fix.
Oook, I meant to close the PR, not the original issue. Re-opening....
AWS frequently auto-applies bug-fixes and/or minor version updates during maintenance windows, generally causing a short outage of the RDS instance(s) affected. I've found that each time this happens, I'm forced to manually login to the salt masters and
service salt-master restart
, due to the masters failing to retry after the timeout.I would expect that for such simple cases, the master would have the option of (or simply the default behaviour of) retrying the DB backend until it becomes available again. Traceback from master log:
Master deets:
Thanks! t.