Unbound memory footprint growth with slow consumers

mprimi commented 1 year ago

Defect

[x] Included nats.go version
[x] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Versions of `nats.go` and the `nats-server` if one was involved:

nats.go @ 95a7e5090fda9532342375fe2954362905d43434 (today's main branch) nats-server @ 2.9.8 (probably not relevant)

OS/Container environment:

Not relevant

Steps or code to reproduce the issue:

Create one or more NATS consumers that can't/won't keep up with actual traffic, e.g:
- Sync subscription not being consumed
- Async subscription where handler is slow
Publish messages faster than consumers consume them
Observe memory at the client grow until OOM

This test (artificially) reproduces the situation in just a few seconds: https://github.com/mprimi/nats.go/commit/78cee182ea34ed864e2b67653a9b81fd2f1ecf1c

Expected result:

Client of slow consumers starts dropping messages. (This is NATS without JetStream, so message loss is acceptable)

Actual result:

Messages build up in the consumer client until the program starts trashing or gets OOM killed

Comments

The attached repro shows how a client can blow up in just a few seconds:

    [...]
    Published 1319 messages (12 MiB), runtime mem: 59 GiB
    Published 1379 messages (13 MiB), runtime mem: 62 GiB
    Published 1435 messages (14 MiB), runtime mem: 65 GiB
    Published 1551 messages (15 MiB), runtime mem: 70 GiB
    Published 1607 messages (15 MiB), runtime mem: 73 GiB
    Published 1663 messages (16 MiB), runtime mem: 76 GiB

Process finished with the exit code 1 (OOM panic)

The test is artificial, probably no real-world application behaves this way (e.g. thousands of subscriptions to the same subject, none being consumed).

However this behavior is generalizable, this is the real takeaway:

If an application is consuming subscriptions at a rate slower than messages are being published, then the client memory usage will keep growing without bounds. Eventually this application will run out of memory and crash (it may take hours/days/weeks instead of seconds).

Suggested change

The client should cap the amount of memory used to store unconsumed messages. Once the limit is reached, the client should start dropping messages.

Optimizations

This experiment also suggests a couple possible optimizations.

Take the following situation:

Publisher is publishing the same 10KiB message 1000 times with subject 's'
Subscriber has 5000 subscriptions matching 's'

As of now, the subscriber client will use an estimate ~50GiB (10KiB 1000 messages 5000 subscriptions)

1) If the message content was de-duped internally and shared across subscriptions, then the memory footprint could be just 10MiB (10KiB * 1000 messages)

2) If the message content was further de-duped because the client notices it's the same message being published over and over, then the footprint could be just 10KiB

(I'm not suggesting semantical de-duping -- just internal memory storage de-duping, not visible outside the client)

These optimizations may not be very beneficial in 'real world' scenario, since a client may not often be receiving tons of duplicate messages in the same subscription or across subscriptions.

wallyqs commented 1 year ago

Each async subscription would reach the 64MB buffering limit before becoming a slow consumer, so looks like it is running out of memory after about 1000 subscription have become slow consumers. https://github.com/nats-io/nats.go/blob/main/nats.go#L4650

cesarvspr commented 1 year ago

would you have a minimal reproduction just in case? @mprimi

nats-io / nats.go