Closed sbueringer closed 2 years ago
@fabriziopandini @vincepri @randomvariable Opinions?
/assign @Karthik-K-N
/milestone v1.1
Based on the discussion on https://github.com/kubernetes-sigs/cluster-api/pull/5966 we don't want to recover panics, so that users can't miss the panic because the controller crashloops. To improve the logging of panics in cases where the panic should not be recovered I've opened an issue in controller-runtime: https://github.com/kubernetes-sigs/controller-runtime/issues/1793
/retitle Log panics with current log context during reconcile
User Story
As a operator it would be great if ClusterAPI automatically recovers panics from reconciles and logs the panic instead of failing entirely.
Detailed Description
Currently, if there is a panic in a CAPI controller the whole controller fails / shuts down. This means that if there is a problem with the reconciliation of an individual resource the whole controller fails and becomes useless. It would be great if we instead recover that panic and just return an error in that specific Reconcile call.
Furthermore, it's hard to debug panics because there is currently no way to correlate a panic to a specific resource.
Cases where recover would have helped (just from yesterday):
Anything else you would like to add:
I would propose to add the following statement at the top of all our
Reconcile
funcs:This leads to the following log message and the controller does not fail:
That looks hard to read, but with a proper logging UI, it's easy to extract the following information (which would have be hard to infer otherwise):
[Miscellaneous information that will assist in solving the issue.]
This is potentially something which should be addressed in controller-runtime instead.
/kind feature /area health