Some specific service handling questions

hongkongkiwi commented 3 years ago

I have a couple of questions regarding services that I couldn't figure out.

Is there a way to make s6-rc to have a specific time between restarting services? I'm using an embedded processor and right now if a service fails it tries to restart immediately. I would like to set a delay appropriate to the service before it retries again. This way if lots of services fail at once for some reason, they don't hammer the system. For example, right now I have a internet connection monitoring daemon, if it fails, I'm ok to wait 5s between retries.
Is there a way to differentiate between an error that is unrecoverable and something that is not? For example as the example above, I have an internet connection daemon which might fail because a config file does not exist. I'd prefer to have it only restart if the error code is below or above a certain number .... e.g. if errorcode is < 100 then keep supervising and restart, if it's above 100 then don't supervise.
Is there some kind of failure count mechanism? There are some services I would like to retry 10 times then give up on. Once the service is running, I would like to supervise. So this is a kind of semi-supervision state.

I'm sure there are some clever ways to address my issues above, any ideas on how I could handle these cases?

skarnet commented 3 years ago

Those are really s6 questions, since they aren't about service dependencies and ordering, and only involve longruns. They are all addressed at the s6 level - but if you need to add a file to a s6 service directory, the same file will work in a s6-rc source definition directory for the longrun, so the answers work for s6-rc as well.

s6 never busyloops services, there is always at least one second between retries.
Several services failing at once and being restarted at the same time is never a problem in practice. But if you have to, you could add a finish script that sleeps for 5 seconds. If you need more than 5 seconds, you will also need to extend the authorized running time for finish, via a timeout-finish file. The finish script and timeout-finish file, as well as other configuration knobs that you may want to use, are documented here.
To change behaviours depending on the exit code of a daemon, you need a finish script. finish runs with two arguments, one of which is the exit code of the program. You can script the behaviour you want there. If you want s6 to fail the service permanently and stop supervising it, simply have your finish script exit 125.
s6's death tally system allows you to tailor behaviour according to a failure count. You can use the s6-permafailon program to leverage it - in the finish script again.

hongkongkiwi commented 3 years ago

Thanks for the suggestions! I will explore these. That helps a lot.

I didn't realise that finish could be used in this way, and didn't see s6-permafailon before.

skarnet / s6-rc

Some specific service handling questions #6