feat: add console manager supervisor logic w/ restart option

floreks commented 4 days ago

Added supervised controller start functionality that can restart the controller if a last poll/reconcile time indicates that it might have died
Updated Reconciler interface and Controller implementation to allow it to be restarted
Removed 3 different logger implementations usage across the codebase and replaced it with a single klog logger.
poll/refresh/jitter interval args are now correctly used by the controllers
Cleaned up console Reconcilers. Some struct fields were not used anywhere
Refactored gate cache queue into a standalone cache that can be safely reused by multiple goroutines now
Refactored queue usage across console reconcilers to use getter instead of reference to a variable
Refactored PollUntilContextCancel usage in the console controller manager not to rely on our internal method implementation when deciding when to stop polling. Internal method will only return error now that can be logged but the poll function will always return false, nil (never stop).
Added controller restart metric counter to be able to track the number of per controller restarts (if any)

linear[bot] commented 4 days ago

PROD-2611 deployment operator service reconcilers died

floreks commented 3 days ago

What might be problematic with this approach is detecting if the controller is still running or not. Heartbeat in this case is the last poll time. Since we have information about how often polling should be executed, we can calculate the time difference between last poll time and current time to see if controller could be dead.

Recovering from panic technically does not help us much since if it will panic the app should crash and pod will be restarted anyway.

We should try to avoid a situation where there is no panic but controller for some unknown reason stopped polling/reconciling.

maciaszczykm commented 3 days ago

I reviewed as well, then we talked about it with @floreks and @zreigz. It looks good to me, issues with pollers being stuck for any reasons should not happen anymore. One thing that can be added is validation for args to avoid situations like poll interval or jitter being too short.

pluralsh / deployment-operator

feat: add console manager supervisor logic w/ restart option #270