Open mprimi opened 1 year ago
Each async subscription would reach the 64MB buffering limit before becoming a slow consumer, so looks like it is running out of memory after about 1000 subscription have become slow consumers. https://github.com/nats-io/nats.go/blob/main/nats.go#L4650
would you have a minimal reproduction just in case? @mprimi
Defect
Versions of
nats.go
and thenats-server
if one was involved:nats.go
@95a7e5090fda9532342375fe2954362905d43434
(today'smain
branch)nats-server
@ 2.9.8 (probably not relevant)OS/Container environment:
Not relevant
Steps or code to reproduce the issue:
This test (artificially) reproduces the situation in just a few seconds: https://github.com/mprimi/nats.go/commit/78cee182ea34ed864e2b67653a9b81fd2f1ecf1c
Expected result:
Client of slow consumers starts dropping messages. (This is NATS without JetStream, so message loss is acceptable)
Actual result:
Messages build up in the consumer client until the program starts trashing or gets OOM killed
Comments
The attached repro shows how a client can blow up in just a few seconds:
The test is artificial, probably no real-world application behaves this way (e.g. thousands of subscriptions to the same subject, none being consumed).
However this behavior is generalizable, this is the real takeaway:
If an application is consuming subscriptions at a rate slower than messages are being published, then the client memory usage will keep growing without bounds. Eventually this application will run out of memory and crash (it may take hours/days/weeks instead of seconds).
Suggested change
The client should cap the amount of memory used to store unconsumed messages. Once the limit is reached, the client should start dropping messages.
Optimizations
This experiment also suggests a couple possible optimizations.
Take the following situation:
As of now, the subscriber client will use an estimate ~50GiB (10KiB 1000 messages 5000 subscriptions)
1) If the message content was de-duped internally and shared across subscriptions, then the memory footprint could be just 10MiB (10KiB * 1000 messages)
2) If the message content was further de-duped because the client notices it's the same message being published over and over, then the footprint could be just 10KiB
(I'm not suggesting semantical de-duping -- just internal memory storage de-duping, not visible outside the client)
These optimizations may not be very beneficial in 'real world' scenario, since a client may not often be receiving tons of duplicate messages in the same subscription or across subscriptions.