nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
16.02k stars 1.41k forks source link

return success on graceful shutdown #6083

Open scottlamb opened 2 weeks ago

scottlamb commented 2 weeks ago

Proposed change

I'd like the NATS server to return exit code 0 (success) after graceful shutdown, instead of exit code 1 (failure).

Use case

It should be possible to monitor the NATS server for failures, without false positives on every shutdown. Unfortunately a real failure and a successful graceful shutdown both indicate the same status to the OS.

In #3514, @kozlovic suggested a workaround of sending a SIGINT instead of SIGTERM to start graceful shutdown; on SIGINT nats will correctly return 0. Unfortunately this looks painful to achieve, at least on AWS Elastic Container Service:

So I'd like to re-examine the rationale for returning exit code 1 on SIGTERM. In #2103, @wallyqs wrote:

Before 2.2 series, the TERM signal used to not be handled by the server, so it would not have been a clean exit. In 2.2, it was changed to process TERM signal as a clean exit but this affects the behavior of some tools that were expecting TERM to be exit 1.

and in #3514, @kozlovic wrote:

Although I agree that SIGTERM is supposed to be a normal exit, there is history in why we need to return status 1, and it has to do with some tooling or k8s if I recall correctly. In some situations, if we were to exit(0), the process would not be restarted by manage services.

I suspect the issue is that some folks have misconfigured Kubernetes. It see it has a restartPolicy configuration attribute; I would recommend setting it to Always for nats. Then nats can return status 0 (allowing correct monitoring) and will still be restarted correctly.

As for how to proceed, I see a couple viable approaches:

  1. My preference: rip off the bandaid, make the change to correctly return 0. Call it out in the release notes, with a note to check that on Kubernetes restartPolicy is set to Always. Or, if nats maintainers deem that too risky (e.g. if you think folks will miss this note, or you fear there are other orchestrators affected that folks will not know how to configure correctly)...
  2. Alternatively: conservatively introduce a commandline flag to switch the behavior: e.g. --shutdown_exit_code with options legacy (current behavior: 1 on SIGTERM, 0 on SIGINT), 1, or 0, defaulting to legacy. Ability to control commandline flags is universal; ability to control the signal used for graceful termination is not.

Contribution

Sure!