nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.27k stars 1.37k forks source link

Healthcheck fails when JetStream account is removed from configuration #5459

Open pcsegal opened 2 months ago

pcsegal commented 2 months ago

Observed behavior

When I do the following:

[970687] 2024/05/20 15:38:10.744645 [WRN] Healthcheck failed: "JetStream can not lookup account \"acc1\": account missing"

Expected behavior

NATS should still load if an account is removed from the configuration.

Server and client version

Server: 2.10.14 Client: 0.1.1

Host environment

Ubuntu 20.04, amd64.

Steps to reproduce

Here is a gist with an example reproducing the issue:

https://gist.github.com/pcsegal/532d15b827d9b13f8a1456e95f1ebc52

The script test-cluster.sh and the accompanying files should all be in the same directory.

The script runs through the described scenario.

In the end, the node in which the KV bucket was placed should be unable to load. It should show the following warning in the logs:

[970687] 2024/05/20 15:38:10.744645 [WRN] Healthcheck failed: "JetStream can not lookup account \"acc1\": account missing"

In turn, the other nodes will show the following warning:

Update Stream Account acc1, error on lookup: account missing
derekcollison commented 2 months ago

As the system user do the following.

nats server account purge acc1

pcsegal commented 2 months ago

Thank you.

If I understand correctly, this needs to be run before I remove the account from the configuration, right?

derekcollison commented 2 months ago

It can be run at any time, so if you run it now it will instruct the system to remove any jetstream artifacts from that account that are still on the system.

pcsegal commented 1 month ago

Thank you.

So, in a situation where accounts represent tenancies that can be decommissioned, forgetting to purge the account first could lead to downtime, if some stream with only 1 replica happens to live in the node that failed the healthcheck.

If I want to automate account purging, can something like the NACK operator help here? I see that NACK allows managing accounts via CRDs.

Jarema commented 1 month ago

NACK does not allow purging accounts.

However, you can achieve that programatically by sending a Request to $JS.API.ACCOUNT.PURGE.{ACC_NAME} in the client, using a System account. That achieves the same result as the CLI call.

pcsegal commented 1 month ago

Thank you; how about purging streams? Would NACK help with purging individual streams (rather than the entire account) when a stream CRD is deleted?