Improve Resilience of CoreDNS Manager Operator for Disaster Scenarios

Is your feature request related to a problem? Please describe. During redundancy tests (described here), I found that while overall failover works, there are problems in disaster scenarios. The CoreDNS Manager Operator is not resilient enough and could be improved. There are rare situations where CoreDNS stops serving zones, which requires a restart of CoreDNS or the operator. It should monitor CoreDNS events or perform resolve monitoring. If a problem occurs, it should trigger a reload of CoreDNS. In my failover tests, sometimes name resolution was disrupted due to load balancer behavior. More failover and scale tests are needed to investigate this behavior. All tests were done using k3s.

Describe the solution you'd like I would like the operator to watch CoreDNS events and perform resolve monitoring. If an issue is detected, it should automatically trigger a reload of CoreDNS to ensure it continues serving zones properly.

Describe alternatives you've considered

Manually reloading CoreDNS when an issue is detected.
Using external monitoring tools to watch CoreDNS and trigger reloads.

Additional context These issues were found during extensive failover tests on a k3s cluster. Improving the resilience of the CoreDNS Manager Operator will ensure more reliable DNS service in air-gapped environments.

monkale-io / coredns-manager-operator

Improve Resilience of CoreDNS Manager Operator for Disaster Scenarios #9