JetStream `Consumer assignment not cleaned up, retrying`, degraded cluster performance

rh2048 commented 2 years ago

Defect

When using a lot of ephemeral consumers, a cluster with significant load eventually starts spamming a sample like below:

[1] 2021/12/13 02:44:15.208575 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.230045 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.241153 [INF] JetStream cluster new consumer leader for 'XXXX > YYYY > ZZZZ'
[1] 2021/12/13 02:44:15.289728 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.289972 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.300970 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.308939 [INF] JetStream cluster new consumer leader for 'XXXX > YYYY > ZZZZ'
[1] 2021/12/13 02:44:15.321413 [WRN] Consumer assignment not cleaned up, retrying
[1] 2021/12/13 02:44:15.359668 [WRN] Consumer assignment not cleaned up, retrying

I'm unsure if the leader migration for ephemeral consumers is normal even if none of the servers have rebooted.

nats str info of an affected stream:

Information for Stream XXXX created 2021-12-10T15:32:39Z

Configuration:

             Subjects: XXXX.>
     Acknowledgements: true
            Retention: File - Limits
             Replicas: 3
       Discard Policy: Old
     Duplicate Window: 2m0s
     Maximum Messages: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 0.00s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: XXXX
               Leader: prod-nats-js-2
              Replica: prod-nats-js-0, current, seen 0.36s ago
              Replica: prod-nats-js-1, current, seen 0.36s ago

State:

             Messages: 308,916
                Bytes: 63 MiB
             FirstSeq: 37,439 @ 2021-12-11T03:57:36 UTC
              LastSeq: 346,354 @ 2021-12-13T03:23:29 UTC
     Active Consumers: 57

nats con info of an affected ephemeral consumer:

Information for Consumer XXXX > 07ix4IRz created 2021-12-13T03:24:25Z

Configuration:

    Delivery Subject: XXXX.deliver.ephemeral.XXXX-34af714a-aa41-4434-8189-a70c80d85af4.01605c32-a7d7-452d-9287-db2b3aa85924
      Filter Subject: XXXX.34af714a-aa41-4434-8189-a70c80d85af4
      Deliver Policy: All
          Ack Policy: None
       Replay Policy: Instant
        Flow Control: false

Cluster Information:

                Name: XXXX
              Leader: XXXX
State:

   Last Delivered Message: Consumer sequence: 0 Stream sequence: 347268
     Unprocessed Messages: 0
          Active Interest: Active

Make sure that these boxes are checked before submitting your issue -- thank you!

[ ] Included nats-server -DV output (can't reproduce in staging)
[ ] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Versions of `nats-server` and affected client libraries used:

Server: Docker nats:2.6.6

OS/Container environment:

docker-compose Ubuntu 20.04

Steps or code to reproduce the issue:

Start a 3-node JetStream-enabled cluster
Create a stream with replication set to 3
Create a large amount of ephemeral consumers, many of which never have any messages delivered

Expected result:

Server CPU should remain low. Should not see warnings in console. Ephemeral consumers should be removed and not migrated.

Actual result:

CPU load steadily increases. CPU sits at effectively 0% until a steady stream of messages (~70 mps) are sent through the cluster. The two screenshots below are of the same 6-hour window.

ripienaar commented 2 years ago

How many eohemeral consumers did you make?

ephemeral consumers are always R1 so there shouldn’t be any leadership migrations etc

rh2048 commented 2 years ago

jetstream_server_total_consumers and nats con ls seem to report completely different numbers (see below). We have 15 durable consumers, but the rest are ephemeral.

❯ nats con ls MM_LOBBY | wc -l
2577

Here's the relevant code for creating the consumer config using nats.rs:

conn.create_consumer(
  stream,
  nats::jetstream::ConsumerConfig {
  deliver_subject: Some(deliver_subject),
  durable_name: None,
  deliver_policy: nats::jetstream::DeliverPolicy::All,
  ack_policy: nats::jetstream::AckPolicy::None,
  filter_subject,
  replay_policy: nats::jetstream::ReplayPolicy::Instant,
  ..Default::default()
}

These numbers are pretty rough, I can get back to you with more specifics of interest.

Edit: Note that the consumer count going back down in the graph is from us disabling all traffic to JetStream. The nats con ls command was executed during the peak time.

OpenGuidou commented 2 years ago

Hello, Same issue observed on my side. Here is the sample (java) code of the ephemeral consumers:

final String consumerSubject = "XXXX.>"
final String specificSubject = "XXXX.YYYY" // this part is variable for each consumer
final CompletableFuture<void> future = new CompletableFuture<>();
final MessageHandler msgHandler = msg -> {
    log.info("Received message");
    future.complete();
};
final Dispatcher dispatcher = nc.createDispatcher();
final JetStream js = nc.jetStream();
final PushSubscribeOptions options = PushSubscribeOptions.builder()
          .stream(config.getStreamName())
          .configuration(ConsumerConfiguration.builder()
              .filterSubject(specificSubject)
              .build())
          .build();
js.subscribe(consumerSubject, this.config.getQueue(), dispatcher, msgHandler,
    true, options);
future.completeOnTimeout(null, 2, TimeUnit.SECONDS);
future.thenApply(message -> {
  this.nc.closeDispatcher(dispatcher);
  // rest of the process
  return null;
})
.exceptionally(exception -> {
  this.nc.closeDispatcher(dispatcher);
  log.error("Exception occured", exception);
  return null;
}

Stream config:

Subjects:  "XXXX.>"
Replicas: 3
Max age: 60min
Retention policy: limits
Storage type: file

Each consumer process takes a few milliseconds to execute. After a few minutes of subscriptions created/closed at a rate of around 100/second, the issue starts appearing.