RAFT leadership transfers and health check failures

slice-arpitkhatri commented 2 weeks ago

Observed behavior

We've observed frequent RAFT leadership transfers of the $MQTT_PUBREL consumers and health check failures, even in a steady state. Occasionally, these issues escalate, causing sharp spikes in leadership transfers and health check failures, which lead to cluster downtime.

During these intense spikes, metrics from NATS Surveyor show an enormous surge in system messages, with counts reaching billions of messages per minute (metric name: nats_core_account_msgs_recv).

System details

Peak load of 5k MQTT clients, each with 2 QoS 2 subscriptions, totaling 10k subscriptions across 10k MQTT topics.
Messages produced at ~10 RPS
A single NATS queue group subscription is used to consume MQTT-published messages on one topic.

Additional details

Cluster of 3 nodes
max_outstanding_catcup 128MB

Associated logs:

RAFT [cnrtt3eg - C-R3F-yMOeq7kb] Stepping down due to leadership transfer
Falling behind in health check, commit 3202757 != applied 3202742
Healthcheck failed: "JetStream is not current with the meta leader"

leadership transfer

nats traffic in steady state (taken minutes after starting the pods) :

nats-traffic-of-sys-account.txt

Expected behavior

No leadership transfers of consumers & no health check failures in steady state.

Server and client version

Nats Server version 2.10.22

Host environment

Kubernetes v1.25

Steps to reproduce

Setup a 3 node NATS cluster, start 5k MQTT connections with 10k (2 per each client) QOS 2 subscriptions and publish QOS 2 messages at 10 RPS.

neilalexander commented 2 weeks ago

Can you please provide more complete logs from around the times of the problem, as well as server configs?

Do you have account limits and/or max_file/max_mem set?

Normally the only things that should be causing leader transfers on streams in normal operation is a) if you ask it to by issuing a step-down, or b) if you've hit up against the configured JetStream system limits.

slice-arpitkhatri commented 2 weeks ago

@neilalexander We do not have any account level limits. max_file_store is 50GB and max_memory_store is at 10GB.

Have shared the config file and complete logs over email. Let me know if you want any additional details.

neilalexander commented 2 weeks ago

I've taken a look at the logs you sent through but it appears as though the system is already unstable by the start of the logs? Was there a network-level event leading up to this, or any nodes that restarted unexpectedly?

slice-arpitkhatri commented 2 weeks ago

@neilalexander We didn't observe any network-level events. The nodes did restart due to health check failures. I've sent you another email containing additional logs from an hour before the instability occurred. Let me know if that helps or if you have any additional queries

levb commented 2 weeks ago

I am going to try reproducing this from the MQTT side. The QoS2-on-JetStream implementation is quite resource intensive (per sub, and per message), this kind of volume might have introduced failures, and ultimately blocking the IO (readloop) waiting for JS responses before acknowledging back to the MQTT clients, as required by the protocol.

slice-arpitkhatri commented 2 weeks ago

@levb have shared the config file with Neil. Let me know if you need any additional inputs in reproducing this. Can jump on a call as well if required.

slice-arpitkhatri commented 1 week ago

@levb @neilalexander My hunch is that the huge amount of raft sync required for R3 consumers might be causing the instability in the system. Even in steady state scenario we have 2Mil system messages per minute. Let me know your thoughts on this?

@derekcollison Do we have any plans to support R3 file streams with R1 memory consumers?

derekcollison commented 1 week ago

That is supported today. Under mqqt config section you have the following options to control consumers.

config blocks just convert to snakecase, e.g. consumer_replicas = 1

slice-arpitkhatri commented 1 week ago

@derekcollison I believe the consumer_replicas setting under the MQTT config is currently not in use (server ignores this config, see this), and that the consumer replicas are instead aligned with the parent stream replica for interest or workqueue streams ( source )

Additionally, we have already set consumer_replicas as 1 in our production cluster, and I can see that the consumers still have a raft leader, which wouldn't be the case if this consumer replica override config were functional.

Do we have plans to re-introduce this consumer replica override capability?

derekcollison commented 1 week ago

It will work but yes if there are retention based streams backing the MQTT stuff the system will override and force the peer sets to be the same.

This QOS2?

levb commented 1 week ago

@derekcollison this ticket is, but @slice-arpitkhatri said they got into this state with QoS1 as well,

slice-arpitkhatri commented 1 week ago

Yes, have faced the issue with both QOS 1 and QOS 2.

nats-io / nats-server