OOM kill on a small system with workqueue consumers at high speed resulting in fast memory growth -

roeschter commented 1 month ago

Observed behavior

Consuming from a workqueue with multiple consumers (subject segmented) at high speed (150MB/s) result in rapid memory growth and eventually exhaustion. Server will die with a range of symptoms, mostly silently OOM killed, sometimes throwing readloop processing time warning first.

Testing on a small system with 4Gb and 2 CPUs - This is deliberate to simulate an edge node.

This does not seem to be a memory leak per se. Repeated tests with a below threshold message volume do not trigger the effect.

Expected behavior

Self limits memory growth or throughput.

Server and client version

jnats-2.19.1.jar _ Java17 2.10.17 and 2.11-preview-rc3 (self compiled) memory growth is not platform specific. Seen on Windows as well, but crashes observed only on smaller systems with <8GB.

Host environment

Linux - Old Kernel Linux lvps5-35-240-105.dedicated.hosteurope.de 3.10.0-1160.80.1.vz7.191.4 #1 SMP Thu Dec 15 20:31:06 MSK 2022 x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce

playground.jar.zip nats00_hosteurope.config.txt source.zip

Full eclipse project available in private repo: https://github.com/roeschter/natsPlayground

Run on a system with 4GB (or less) - CPU count >=2 Java 17

/var/lib/java17/bin/java -cp playground.jar:jnats-2.19.1.jar com.synadia.test.config03_performance.C03JetstreamWorQueueSlowDeletes --user user --password dja673b48fh3kqp9 --messages 200000 --messageSize 10000 --fetchsize 10 --subjects 10 --consumemode fetch

--messages 200000 Should be sufficient on a 4GB system. Increase to 1000000 if not crashing. --messageSize 10000 Increasing message size my speed the crash up. Not observable with small messages (throughput too low I presume)

Memory should increase rapidly when the consumers start. Rate should be 10000 msg/s or more

roeschter commented 1 month ago

Will experiment with export GOGC=nnn

Jarema commented 1 month ago

For controlling the memory and preventing such OOMs with high load on insufficient resources, you should use GOMEMLIMIT. Keep in mind that if the limit will be a lot below the needs, it will throttle the throughput a lot, while putting more load on CPU (as the GC will need to work way harder).

roeschter commented 1 month ago

GOMEMLIMIT solves the immediate problem (I tried from 256MiB to 1024Mib), but the memory usage reported by the OS still vastly exceeds the limit I set.

Also: I would expect the nats-server to self-limit to a reasonable memory footprint. "Reasonable" would be 50% of system memory when JS is enabled (because we need space with buffers and caches) and 70% for pure Nats Core.

My test system had 2.5GB of 4GB available for NATS when the tests starts

roeschter commented 1 month ago

grafik E.g. In this case I set export GOMEMLIMIT=256MiB export GOGC=20

Resident memory still exceeds 2GB (consistent with overall system memory report) Performance is not much impacted by the limits above.

Note that this only happens when reading from the stream, footprint is small when writing.

roeschter commented 1 month ago

I now created a state of the stream, where nats-server exceeds all limits during startup, recovering the stream and gets killed. export GOMEMLIMIT=256MiB export GOGC=10 Stream size on disk ~3.5GB Resident memory growing to 2.4GB very fast - then OOM killed

neilalexander commented 1 month ago

Can you get a memory profile when the memory usage is inflated, either through nats server request profile allocs or https://docs.nats.io/running-a-nats-service/nats_admin/profiling?

neilalexander commented 1 month ago

Also: I would expect the nats-server to self-limit to a reasonable memory footprint. "Reasonable" would be 50% of system memory when JS is enabled (because we need space with buffers and caches) and 70% for pure Nats Core.

Short of using GOMEMLIMIT, which is actually a runtime hint to the GC and not an actual hard limit, this is simply not practical to do. Runtime stats can't tell us about the actual RSS size, nor can it tell us about buffers/caches from the kernel, and we also don't have custom allocators in Go to actually enforce those limits. Even if we did, failing to allocate structures would just leave the system in an undefined state, which is worse than crashing.

We would need to figure out from profiles what is using the memory and whether that can be optimised for this case.

roeschter commented 1 month ago

Here are the memory profiles: 03_allocs_after_stream_delete_from_client.zip

Observations:

After consuming messages the memory allocated will not be freed even after the client shuts down AND the stream was deleted. (03_allocs_after)
What eventually frees the memory is a new client connection (without stream deletion) and resuming activity.
The whole game starts over

neilalexander commented 1 month ago

OOI how did you capture these profiles? They're in text rather than binary format which means the pprof visualiser can't parse them. Please make sure to use --level=0 if using the CLI or ?debug=0 if using the pprof HTTP endpoint.

roeschter commented 1 month ago

Human readable output via:

http://localhost:65432/debug/pprof/ --> Allocs --> localhost:65432/debug/pprof/allocs?debug=0 Gathering more detailed output

roeschter commented 1 month ago

Testing under Windows now, as the memory growth happens here as well. (just no OOM kill)

Binary Allocs: allocs.zip

The more I look into the behavior the less deterministic it is. Under Windows is respected: Server shrinks back to the set size. BUT under Linux it didnt. In any case, memory temporarily growth to about the amount consumed during 10seconds by the consumers.

Seems the buffer management relies on: https://pkg.go.dev/sync#Pool Seems this only cleans up after multiple GC cycles. But also, why is the memory not effectively being re-used? Who is not returning it fast enough?

neilalexander commented 1 month ago

The sync.Pools are indeed only purged during a GC run, that is by design. Otherwise they largely rely on the number of gets to roughly match the number of puts in order to stay balanced in their resource consumption.

In this case, looking at the profile, it appears as though lots of blocks are being loaded in from the filestore and held in the block cache. They should be held there for 10 seconds from the last access, after which the buffer will be returned to the pool. That would roughly explain why you see the correlation between the data accessed and the memory growth in that window.

Perhaps exposing the cache expiry as a configurable, i.e. being able to set it to less than 10 seconds, would help reduce memory usage at the expense of potentially more I/O operations. As an experiment, you could potentially test with a custom build that changes defaultCacheBufferExpiration in filestore.go to a lower value like time.Second * 2.

roeschter commented 1 month ago

Correct! If the behavior of the block cache would be documented (I can guess it exists, but I incorrectly assumed it to be of fixed size per connection) and its size or timeout configurable I would not have been surprised.

Last post on observed behavior:

Both Linux and Windows respect GOMEMLIMIT . that is they shrink back to the configured size
For the block cache to grow its neccssary to have multiple consumers on the stream, consuming with diffrent subject filters. In the case of workqueue its also required that they leave messages behind (e.g. on consumer out of ten currently down).

It does not happen:

When a single consumer consumers reads on a limits stream
When a workqueue gets fully emptied by its consumers

In summary:

The block cache growth happens in certain circumstances only (hard to guess from the outside)
It does not happen in "trivial" tests (the good cases)
The current unconfigurable timeout makes the memory growth uncontrollable. I can make GOMEMLIMIT small forcing the server to slow down, but that's it.
It happens during stream recovery as well, resulting in nats-server crashingduring startup .

roeschter commented 1 month ago

I will test with : DefaultCacheBufferExpiration in filestore.go to a lower value like time.Second * 2.

roeschter commented 1 month ago

Reducing the buffer timeout to 1s limits the memory consumption to around 600MB in my tests. At teh cost of reduced throughput, which is fine (still about 50MB/s from a workqueue)

Tested with DefaultCacheBufferExpiration = time.Second * 1 Windows and Linux based on 2.11 preview rc3 In combination with GOMEMELIMIT=512MiB

roeschter commented 1 month ago

Closing the investigation, but I would like the CacheBuffer configurable as a small enhancement request. Should I try to patch this myself and make it a pull request?

derekcollison commented 1 month ago

If you could file an issue that would be great.

kohlisid commented 1 month ago

What is the stream maxBytes that you were using? @roeschter

roeschter commented 1 month ago

I was not using maxbytes (unlimited). The effect is purely about the speed of consuming. As as was testing on a 4GB device with 2.5 GB free a 2GB effective stream size was enough to trigger the OOM.

roeschter commented 1 month ago

Closing this in favor of a new CR - https://github.com/nats-io/nats-server/issues/5740

nats-io / nats-server