I'd like the NATS server to return exit code 0 (success) after graceful shutdown, instead of exit code 1 (failure).
Use case
It should be possible to monitor the NATS server for failures, without false positives on every shutdown. Unfortunately a real failure and a successful graceful shutdown both indicate the same status to the OS.
In #3514, @kozlovic suggested a workaround of sending a SIGINT instead of SIGTERM to start graceful shutdown; on SIGINT nats will correctly return 0. Unfortunately this looks painful to achieve, at least on AWS Elastic Container Service:
A 2021 AWS blog post says "Today, ECS always sends a SIGTERM, but in the future you will be able to override this by adding the STOPSIGNAL directive in your Dockerfile and/or task definition." AFAICT, that future has not happened yet.
I could build my own Docker image on top of the official one which adds a shell wrapper that traps SIGTERM and sends SIGINT to the nats process, but this is another thing to maintain as new nats images come out.
So I'd like to re-examine the rationale for returning exit code 1 on SIGTERM. In #2103, @wallyqs wrote:
Before 2.2 series, the TERM signal used to not be handled by the server, so it would not have been a clean exit. In 2.2, it was changed to process TERM signal as a clean exit but this affects the behavior of some tools that were expecting TERM to be exit 1.
and in #3514, @kozlovic wrote:
Although I agree that SIGTERM is supposed to be a normal exit, there is history in why we need to return status 1, and it has to do with some tooling or k8s if I recall correctly. In some situations, if we were to exit(0), the process would not be restarted by manage services.
I suspect the issue is that some folks have misconfigured Kubernetes. It see it has a restartPolicy configuration attribute; I would recommend setting it to Always for nats. Then nats can return status 0 (allowing correct monitoring) and will still be restarted correctly.
As for how to proceed, I see a couple viable approaches:
My preference: rip off the bandaid, make the change to correctly return 0. Call it out in the release notes, with a note to check that on Kubernetes restartPolicy is set to Always. Or, if nats maintainers deem that too risky (e.g. if you think folks will miss this note, or you fear there are other orchestrators affected that folks will not know how to configure correctly)...
Alternatively: conservatively introduce a commandline flag to switch the behavior: e.g. --shutdown_exit_code with options legacy (current behavior: 1 on SIGTERM, 0 on SIGINT), 1, or 0, defaulting to legacy. Ability to control commandline flags is universal; ability to control the signal used for graceful termination is not.
Proposed change
I'd like the NATS server to return exit code 0 (success) after graceful shutdown, instead of exit code 1 (failure).
Use case
It should be possible to monitor the NATS server for failures, without false positives on every shutdown. Unfortunately a real failure and a successful graceful shutdown both indicate the same status to the OS.
In #3514, @kozlovic suggested a workaround of sending a
SIGINT
instead ofSIGTERM
to start graceful shutdown; onSIGINT
nats will correctly return 0. Unfortunately this looks painful to achieve, at least on AWS Elastic Container Service:SIGTERM
, but in the future you will be able to override this by adding theSTOPSIGNAL
directive in your Dockerfile and/or task definition." AFAICT, that future has not happened yet.SIGTERM
and sendsSIGINT
to the nats process, but this is another thing to maintain as new nats images come out.So I'd like to re-examine the rationale for returning exit code 1 on
SIGTERM
. In #2103, @wallyqs wrote:and in #3514, @kozlovic wrote:
I suspect the issue is that some folks have misconfigured Kubernetes. It see it has a restartPolicy configuration attribute; I would recommend setting it to
Always
for nats. Then nats can return status 0 (allowing correct monitoring) and will still be restarted correctly.As for how to proceed, I see a couple viable approaches:
restartPolicy
is set toAlways
. Or, if nats maintainers deem that too risky (e.g. if you think folks will miss this note, or you fear there are other orchestrators affected that folks will not know how to configure correctly)...--shutdown_exit_code
with optionslegacy
(current behavior: 1 onSIGTERM
, 0 onSIGINT
),1
, or0
, defaulting tolegacy
. Ability to control commandline flags is universal; ability to control the signal used for graceful termination is not.Contribution
Sure!