nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.92k stars 1.41k forks source link

NATS server panics when creating many consumers #4831

Closed aimichelle closed 8 months ago

aimichelle commented 11 months ago

Observed behavior

We have an application which uses the nats.go/jetstream client to create 4 streams. We are processing these streams with 4096 partitions, so created 4096 consumers for each stream to handle each partition. These consumers are also created via the client.

While the consumers are being created, we got a panic on the nats-server. The panic log is pretty long, so the log we got from Kubernetes is truncated at the beginning. Log is attached here: https://drive.google.com/file/d/1kfrZwoe9HR2P1SJ-DumdgoyYmA9QTJy4/view?usp=sharing (It was too long for Github)

Please also let us know if this is a proper usage of jetstream.

Expected behavior

No panic, and all streams and consumers are created and functioning.

Server and client version

server: 2.10.5 nats.go: 1.31.0

Host environment

This was running in a GKE cluster: 1.28.3-gke.1118000 with n2d-standard-8 nodes running Ubuntu with containerd.

Steps to reproduce

  1. Deploy NATS/Jetstream to Kubernetes.
  2. Use nats.go/jetstream client to create 4 streams, create 4096 consumers for each stream.
derekcollison commented 11 months ago

Could it have run out of memory?

vihangm commented 11 months ago

Could it have run out of memory?

Hmm, that's certainly a possibility. The memory pressure on these nodes is somewhat high. Let me try again after adding a node or two to the cluster.

vihangm commented 11 months ago

increased resource limits (note that this is sizing for a dev cluster with low traffic, prod will be sized up significantly more)

  container:
    env:
      GOMEMLIMIT: 6750MiB
    merge:
      resources:
        requests:
          cpu: 1
          memory: 5Gi
        limits:
          cpu: 1.5
          memory: 7.5Gi

still seeing plenty of issues even through there are no messages in any of the streams. no crashes anymore, but can get stream reports:

~ # nats stream report
Obtaining Stream stats

nats: error: context deadline exceeded

The streams are all R3 and file based. The cluster is R3. Any suggestions on sizing and/or whether this number of consumers is not recommended?

derekcollison commented 11 months ago

How many consumers are you trying to create?

Are they all inheriting R3 from the stream? Meaning they are HA assets?

vihangm commented 11 months ago

How many consumers are you trying to create?

Are they all inheriting R3 from the stream? Meaning they are HA assets?

It's currently 4096 consumers per stream with 4 streams. So about 16k consumers. I believe all of them do inherit R3.

derekcollison commented 11 months ago

That is alot of HA assets for the system. Each one is complete NRG (RAFT) group underneath. Heartbeats alone would be ~16k msgs/sec not to mention the memory footprint.

We consult with our customers on how best to use the system to achieve their goals, might be something to consider.