orange-cloudfoundry / k3s-wrapper-boshrelease

k3s wrapper scripts bosh release
Apache License 2.0
2 stars 2 forks source link

Add crashloop back off for k3s-server release #63

Open gberche-orange opened 1 month ago

gberche-orange commented 1 month ago

Expected behavior

As an operator In order to avoid crash loop that go unnoticed and mask error root cause such as https://github.com/orange-cloudfoundry/paas-templates/issues/2398 I need k3s-wrapper-boshrelease to back off when entering a crash loop

Observed behavior

tail -f -n 200 /var/vcap/monit/monit.log

#> UTC Aug  1 10:41:31] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:41:41] info     : 'k3s-server' process is running with pid 366216
#> [UTC Aug  1 10:42:41] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:42:41] info     : 'k3s-server' trying to restart
#> [UTC Aug  1 10:42:41] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:42:52] info     : 'k3s-server' process is running with pid 366278
#> [UTC Aug  1 10:43:42] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:43:42] info     : 'k3s-server' trying to restart
#> [UTC Aug  1 10:43:42] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:43:52] info     : 'k3s-server' process is running with pid 366344
#> [UTC Aug  1 10:44:12] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:44:12] info     : 'k3s-server' trying to restart

Possible fix

Use monit support for slow process start

https://web.archive.org/web/20110816041503/https://mmonit.com/monit/documentation/monit.html

if 2 restarts within 3 cycles then timeout

SERVICE TIMEOUT

monit provides a service timeout mechanism for situations where a service simply refuses to start or respond over a longer period.

The timeout mechanism is based on number if service restarts and number of poll-cycles. For example, if a service had x restarts within y poll-cycles (where x <= y) then Monit will perform an action (for example unmonitor the service). If a timeout occurs Monit will send an alert message if you have register interest for this event.

The syntax for the timeout statement is as follows (keywords are in capital):

IF RESTART CYCLE(S) THEN

Here is an example where Monit will unmonitor the service if it was restarted 2 times within 3 cycles:

if 2 restarts within 3 cycles then unmonitor

To have Monit check the service again after a monitoring was disabled, run 'monit monitor ' from the command line.

Example for setting custom exec on timeout:

if 5 restarts within 5 cycles then exec "/foo/bar"

Example for stopping the service:

if 7 restarts within 10 cycles then stop