We received two incidents over the last weeks about failing ACM validator pods. Root cause for both incidents were:
ACM module was not installed on the clustaer
But the ACM validator and gateway were deployed
Solution was to delete these orphan deployments. To avoid further incidents, we should check together with SREs the existence of further orphan deployments on all remaining productive clusters.
AC;
[ ] Implement and Run a script, with support of SREs, which checks on all productive SKR clusters :
Is the ACM module installed? If Yes: ensure the ACM validator and gateway deployment does not exist
[ ] Summarize the results of this maintenance step and inform the team about the outcomes
Reasons
Avoid further incidents caused by orphan ACM validator and gateway deployments.
Description
We received two incidents over the last weeks about failing ACM validator pods. Root cause for both incidents were:
Solution was to delete these orphan deployments. To avoid further incidents, we should check together with SREs the existence of further orphan deployments on all remaining productive clusters.
AC;
Reasons
Avoid further incidents caused by orphan ACM validator and gateway deployments.
Attachments