Closed rh2048 closed 2 years ago
How many eohemeral consumers did you make?
ephemeral consumers are always R1 so there shouldn’t be any leadership migrations etc
jetstream_server_total_consumers
and nats con ls
seem to report completely different numbers (see below). We have 15 durable consumers, but the rest are ephemeral.
❯ nats con ls MM_LOBBY | wc -l
2577
Here's the relevant code for creating the consumer config using nats.rs:
conn.create_consumer(
stream,
nats::jetstream::ConsumerConfig {
deliver_subject: Some(deliver_subject),
durable_name: None,
deliver_policy: nats::jetstream::DeliverPolicy::All,
ack_policy: nats::jetstream::AckPolicy::None,
filter_subject,
replay_policy: nats::jetstream::ReplayPolicy::Instant,
..Default::default()
}
These numbers are pretty rough, I can get back to you with more specifics of interest.
Edit: Note that the consumer count going back down in the graph is from us disabling all traffic to JetStream. The nats con ls
command was executed during the peak time.
Hello, Same issue observed on my side. Here is the sample (java) code of the ephemeral consumers:
final String consumerSubject = "XXXX.>"
final String specificSubject = "XXXX.YYYY" // this part is variable for each consumer
final CompletableFuture<void> future = new CompletableFuture<>();
final MessageHandler msgHandler = msg -> {
log.info("Received message");
future.complete();
};
final Dispatcher dispatcher = nc.createDispatcher();
final JetStream js = nc.jetStream();
final PushSubscribeOptions options = PushSubscribeOptions.builder()
.stream(config.getStreamName())
.configuration(ConsumerConfiguration.builder()
.filterSubject(specificSubject)
.build())
.build();
js.subscribe(consumerSubject, this.config.getQueue(), dispatcher, msgHandler,
true, options);
future.completeOnTimeout(null, 2, TimeUnit.SECONDS);
future.thenApply(message -> {
this.nc.closeDispatcher(dispatcher);
// rest of the process
return null;
})
.exceptionally(exception -> {
this.nc.closeDispatcher(dispatcher);
log.error("Exception occured", exception);
return null;
}
Stream config:
Subjects: "XXXX.>"
Replicas: 3
Max age: 60min
Retention policy: limits
Storage type: file
Each consumer process takes a few milliseconds to execute. After a few minutes of subscriptions created/closed at a rate of around 100/second, the issue starts appearing.
For the folks reporting these, were the ephemerals unsubscribed? What code was executed when the ephemeral consumers were done?
In my case it's the
this.nc.closeDispatcher(dispatcher)
that does the job (in both regular and exceptional outcomes), by unsubscribing all subscriptions associated with the dispatcher and freeing the thread. I tried with an explicit unsubscribe before, with the same result.
ok thanks, digging into this one today, thanks for the patience.
I now have an idea of what is going on, we have a bit of a perfect storm going in that many Go routines become blocked on disk IO and get removed from the runnable pool. We then also have a run on the bank with many Go routines becoming runnable which causes the Go runtime to create new threads until we hit the default 10k limit.
I have some ideas that I will test out after a bit more thinking on this, but this will happen today. Will keep this thread updated.
Posted a PR with a fix, hopefully. I could recreate the bad behavior artificially well enough so I think this will solve it. Once it lands in main, hopefully tomorrow am PT TZ, we will cut a new nightly image for folks to try out.
This has landed and I manually kicked our nightly build process. Please test and report any issues. And thanks again for the patience.
Ok thanks. Is there an image version we can test it with ?
https://hub.docker.com/r/synadia/nats-server always has a nightly build.
I was able to test with the last version. At 10 sub/unsub /second it's ok. At 100 /sec, after one minute the issue still occurs. Let me know if I can provide more information about something specific.
What occurs specifically? The log statements about not being able to see the consumers clean up or does the system become unresponsive?
What are the memory and CPU usage at this time? Is it balanced or is one server spiking?
Defect
When using a lot of ephemeral consumers, a cluster with significant load eventually starts spamming a sample like below:
I'm unsure if the leader migration for ephemeral consumers is normal even if none of the servers have rebooted.
nats str info
of an affected stream:nats con info
of an affected ephemeral consumer:Make sure that these boxes are checked before submitting your issue -- thank you!
nats-server -DV
output (can't reproduce in staging)Versions of
nats-server
and affected client libraries used:Server: Docker
nats:2.6.6
OS/Container environment:
docker-compose Ubuntu 20.04
Steps or code to reproduce the issue:
Expected result:
Server CPU should remain low. Should not see warnings in console. Ephemeral consumers should be removed and not migrated.
Actual result:
CPU load steadily increases. CPU sits at effectively 0% until a steady stream of messages (~70 mps) are sent through the cluster. The two screenshots below are of the same 6-hour window.