hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

Improve "grece" period for "check". #23661

Open EugenKon opened 1 month ago

EugenKon commented 1 month ago

Proposal

grace 30s VS 5m (in case of database upgrade).

My interpretation of documentation is that grace instructs Nomad to not check the health of a service for this period at all. The logic can be summarized as follows:

Start service
Wait "grace" period
Start monitoring health check

I expected it is implemented a different "grace" strategy: discard failed health check while in grace period. The logic can be summarized as follows:

Start service
While in "grace" period:
    Run health check logic
    If OK, break
Start monitoring health check

The difference is that in the first case Nomad always waits for 5 minutes, even if the service is ready and healthy 30 seconds later. Whereas in the second case, Nomad will mark the service as healthy as soon as it passes the health criteria -- 30 seconds after its start.

jrasell commented 1 month ago

Hi @EugenKon, could you provide the links to the documentation you are referring to please, so I can ensure I am correctly comparing it with what you have detailed? Thanks.

EugenKon commented 1 month ago

@jrasell This one:

https://developer.hashicorp.com/nomad/docs/job-specification/check_restart#example-behavior

When the server task first starts and is registered in Consul, its health will not be checked for 90 seconds. This gives the server time to startup.