nats-io / nats.go

Golang client for NATS, the cloud native messaging system.
https://nats.io
Apache License 2.0
5.59k stars 700 forks source link

Unbound memory footprint growth with slow consumers #1163

Open mprimi opened 1 year ago

mprimi commented 1 year ago

Defect

Versions of nats.go and the nats-server if one was involved:

nats.go @ 95a7e5090fda9532342375fe2954362905d43434 (today's main branch) nats-server @ 2.9.8 (probably not relevant)

OS/Container environment:

Not relevant

Steps or code to reproduce the issue:

This test (artificially) reproduces the situation in just a few seconds: https://github.com/mprimi/nats.go/commit/78cee182ea34ed864e2b67653a9b81fd2f1ecf1c

Expected result:

Client of slow consumers starts dropping messages. (This is NATS without JetStream, so message loss is acceptable)

Actual result:

Messages build up in the consumer client until the program starts trashing or gets OOM killed

Comments

The attached repro shows how a client can blow up in just a few seconds:

    [...]
    Published 1319 messages (12 MiB), runtime mem: 59 GiB
    Published 1379 messages (13 MiB), runtime mem: 62 GiB
    Published 1435 messages (14 MiB), runtime mem: 65 GiB
    Published 1551 messages (15 MiB), runtime mem: 70 GiB
    Published 1607 messages (15 MiB), runtime mem: 73 GiB
    Published 1663 messages (16 MiB), runtime mem: 76 GiB

Process finished with the exit code 1 (OOM panic)

The test is artificial, probably no real-world application behaves this way (e.g. thousands of subscriptions to the same subject, none being consumed).

However this behavior is generalizable, this is the real takeaway:

If an application is consuming subscriptions at a rate slower than messages are being published, then the client memory usage will keep growing without bounds. Eventually this application will run out of memory and crash (it may take hours/days/weeks instead of seconds).

Suggested change

The client should cap the amount of memory used to store unconsumed messages. Once the limit is reached, the client should start dropping messages.

Optimizations

This experiment also suggests a couple possible optimizations.

Take the following situation:

As of now, the subscriber client will use an estimate ~50GiB (10KiB 1000 messages 5000 subscriptions)

1) If the message content was de-duped internally and shared across subscriptions, then the memory footprint could be just 10MiB (10KiB * 1000 messages)

2) If the message content was further de-duped because the client notices it's the same message being published over and over, then the footprint could be just 10KiB

(I'm not suggesting semantical de-duping -- just internal memory storage de-duping, not visible outside the client)

These optimizations may not be very beneficial in 'real world' scenario, since a client may not often be receiving tons of duplicate messages in the same subscription or across subscriptions.

wallyqs commented 1 year ago

Each async subscription would reach the 64MB buffering limit before becoming a slow consumer, so looks like it is running out of memory after about 1000 subscription have become slow consumers. https://github.com/nats-io/nats.go/blob/main/nats.go#L4650

cesarvspr commented 1 year ago

would you have a minimal reproduction just in case? @mprimi