nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.27k stars 1.37k forks source link

JetStream `Consumer assignment not cleaned up, retrying`, degraded cluster performance #2742

Closed rh2048 closed 2 years ago

rh2048 commented 2 years ago

Defect

When using a lot of ephemeral consumers, a cluster with significant load eventually starts spamming a sample like below:

[1] 2021/12/13 02:44:15.208575 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.230045 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.241153 [INF] JetStream cluster new consumer leader for 'XXXX > YYYY > ZZZZ'
[1] 2021/12/13 02:44:15.289728 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.289972 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.300970 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.308939 [INF] JetStream cluster new consumer leader for 'XXXX > YYYY > ZZZZ'
[1] 2021/12/13 02:44:15.321413 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.359668 [WRN] Consumer assignment not cleaned up, retrying

I'm unsure if the leader migration for ephemeral consumers is normal even if none of the servers have rebooted.

nats str info of an affected stream:

Information for Stream XXXX created 2021-12-10T15:32:39Z

Configuration:

             Subjects: XXXX.>
     Acknowledgements: true
            Retention: File - Limits
             Replicas: 3
       Discard Policy: Old
     Duplicate Window: 2m0s
     Maximum Messages: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 0.00s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: XXXX
               Leader: prod-nats-js-2
              Replica: prod-nats-js-0, current, seen 0.36s ago
              Replica: prod-nats-js-1, current, seen 0.36s ago

State:

             Messages: 308,916
                Bytes: 63 MiB
             FirstSeq: 37,439 @ 2021-12-11T03:57:36 UTC
              LastSeq: 346,354 @ 2021-12-13T03:23:29 UTC
     Active Consumers: 57

nats con info of an affected ephemeral consumer:

Information for Consumer XXXX > 07ix4IRz created 2021-12-13T03:24:25Z

Configuration:

    Delivery Subject: XXXX.deliver.ephemeral.XXXX-34af714a-aa41-4434-8189-a70c80d85af4.01605c32-a7d7-452d-9287-db2b3aa85924
      Filter Subject: XXXX.34af714a-aa41-4434-8189-a70c80d85af4
      Deliver Policy: All
          Ack Policy: None
       Replay Policy: Instant
        Flow Control: false

Cluster Information:

                Name: XXXX
              Leader: XXXX
State:

   Last Delivered Message: Consumer sequence: 0 Stream sequence: 347268
     Unprocessed Messages: 0
          Active Interest: Active

Make sure that these boxes are checked before submitting your issue -- thank you!

Versions of nats-server and affected client libraries used:

Server: Docker nats:2.6.6

OS/Container environment:

docker-compose Ubuntu 20.04

Steps or code to reproduce the issue:

  1. Start a 3-node JetStream-enabled cluster
  2. Create a stream with replication set to 3
  3. Create a large amount of ephemeral consumers, many of which never have any messages delivered

Expected result:

Server CPU should remain low. Should not see warnings in console. Ephemeral consumers should be removed and not migrated.

Actual result:

CPU load steadily increases. CPU sits at effectively 0% until a steady stream of messages (~70 mps) are sent through the cluster. The two screenshots below are of the same 6-hour window.

image image
ripienaar commented 2 years ago

How many eohemeral consumers did you make?

ephemeral consumers are always R1 so there shouldn’t be any leadership migrations etc

rh2048 commented 2 years ago

jetstream_server_total_consumers and nats con ls seem to report completely different numbers (see below). We have 15 durable consumers, but the rest are ephemeral.

❯ nats con ls MM_LOBBY | wc -l
2577
image

Here's the relevant code for creating the consumer config using nats.rs:

conn.create_consumer(
  stream,
  nats::jetstream::ConsumerConfig {
  deliver_subject: Some(deliver_subject),
  durable_name: None,
  deliver_policy: nats::jetstream::DeliverPolicy::All,
  ack_policy: nats::jetstream::AckPolicy::None,
  filter_subject,
  replay_policy: nats::jetstream::ReplayPolicy::Instant,
  ..Default::default()
}

These numbers are pretty rough, I can get back to you with more specifics of interest.

Edit: Note that the consumer count going back down in the graph is from us disabling all traffic to JetStream. The nats con ls command was executed during the peak time.

OpenGuidou commented 2 years ago

Hello, Same issue observed on my side. Here is the sample (java) code of the ephemeral consumers:

final String consumerSubject = "XXXX.>"
final String specificSubject = "XXXX.YYYY" // this part is variable for each consumer
final CompletableFuture<void> future = new CompletableFuture<>();
final MessageHandler msgHandler = msg -> {
    log.info("Received message");
    future.complete();
};
final Dispatcher dispatcher = nc.createDispatcher();
final JetStream js = nc.jetStream();
final PushSubscribeOptions options = PushSubscribeOptions.builder()
          .stream(config.getStreamName())
          .configuration(ConsumerConfiguration.builder()
              .filterSubject(specificSubject)
              .build())
          .build();
js.subscribe(consumerSubject, this.config.getQueue(), dispatcher, msgHandler,
    true, options);
future.completeOnTimeout(null, 2, TimeUnit.SECONDS);
future.thenApply(message -> {
  this.nc.closeDispatcher(dispatcher);
  // rest of the process
  return null;
})
.exceptionally(exception -> {
  this.nc.closeDispatcher(dispatcher);
  log.error("Exception occured", exception);
  return null;
}

Stream config:

Subjects:  "XXXX.>"
Replicas: 3
Max age: 60min
Retention policy: limits
Storage type: file

Each consumer process takes a few milliseconds to execute. After a few minutes of subscriptions created/closed at a rate of around 100/second, the issue starts appearing.

derekcollison commented 2 years ago

For the folks reporting these, were the ephemerals unsubscribed? What code was executed when the ephemeral consumers were done?

OpenGuidou commented 2 years ago

In my case it's the

this.nc.closeDispatcher(dispatcher)

that does the job (in both regular and exceptional outcomes), by unsubscribing all subscriptions associated with the dispatcher and freeing the thread. I tried with an explicit unsubscribe before, with the same result.

derekcollison commented 2 years ago

ok thanks, digging into this one today, thanks for the patience.

derekcollison commented 2 years ago

I now have an idea of what is going on, we have a bit of a perfect storm going in that many Go routines become blocked on disk IO and get removed from the runnable pool. We then also have a run on the bank with many Go routines becoming runnable which causes the Go runtime to create new threads until we hit the default 10k limit.

I have some ideas that I will test out after a bit more thinking on this, but this will happen today. Will keep this thread updated.

derekcollison commented 2 years ago

Posted a PR with a fix, hopefully. I could recreate the bad behavior artificially well enough so I think this will solve it. Once it lands in main, hopefully tomorrow am PT TZ, we will cut a new nightly image for folks to try out.

derekcollison commented 2 years ago

This has landed and I manually kicked our nightly build process. Please test and report any issues. And thanks again for the patience.

OpenGuidou commented 2 years ago

Ok thanks. Is there an image version we can test it with ?

derekcollison commented 2 years ago

https://hub.docker.com/r/synadia/nats-server always has a nightly build.

OpenGuidou commented 2 years ago

I was able to test with the last version. At 10 sub/unsub /second it's ok. At 100 /sec, after one minute the issue still occurs. Let me know if I can provide more information about something specific.

derekcollison commented 2 years ago

What occurs specifically? The log statements about not being able to see the consumers clean up or does the system become unresponsive?

What are the memory and CPU usage at this time? Is it balanced or is one server spiking?