pluralsh / deployment-operator

thin kubernetes agent to execute deployments of plural services
1 stars 1 forks source link

feat: add console manager supervisor logic w/ restart option #270

Open floreks opened 4 days ago

floreks commented 4 days ago
linear[bot] commented 4 days ago

PROD-2611 deployment operator service reconcilers died

floreks commented 3 days ago

What might be problematic with this approach is detecting if the controller is still running or not. Heartbeat in this case is the last poll time. Since we have information about how often polling should be executed, we can calculate the time difference between last poll time and current time to see if controller could be dead.

Recovering from panic technically does not help us much since if it will panic the app should crash and pod will be restarted anyway.

We should try to avoid a situation where there is no panic but controller for some unknown reason stopped polling/reconciling.

maciaszczykm commented 3 days ago

I reviewed as well, then we talked about it with @floreks and @zreigz. It looks good to me, issues with pollers being stuck for any reasons should not happen anymore. One thing that can be added is validation for args to avoid situations like poll interval or jitter being too short.