Implement a strategy to handle OOM in direct memory

andsel commented 1 year ago

Release notes

Direct memory used by plugin weren't tracked and this could generate out-of-memory crashes when unattended spikes or traffic loads reached the plugin. The plugin is changed to monitor how much direct memory it's using and in case it's close to the limit dropping connection to free some space. To control spikes of memory usage, the incoming message reads switched from push mode to pull, so that the plugin has control on rate of ingestion and it's not the determined by the clients.

What does this PR do?

Changes the way the plugin pulls data and handle incoming connections to respect the max direct memory size and avoid out of memory errors.

switches to pull read instead of push read. Instead to respond to every buffer that a connection present, it's the plugin that actively grab the byte buffer to process.
reject new connections if the direct memory is almost exhausted.
if OOM on direct memory happens, close the connection that receive it, trying to lowering the pressure on the memory.

This work is based on the existing #410.

Why is it important/What is the impact to the user?

Provides a set of countermeasures to limit the probability of OutOfMemory errors when creating new buffers in direct memory. In addition to this, it introduces a minimal amount fo direct memory (256MB) that's required to start processing data, if not respected, the pipeline used fails to start.

Checklist

[x] My code follows the style guidelines of this project
[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] I have made corresponding change to the default configuration files (and/or docker env variables)
[x] I have added tests that prove my fix is effective or that my feature works

Author's Checklist

[x] run with some real beats instances

How to test this PR locally

The test plan has some steps:

generate TLS certificates
build the plugin and configure Logstash
set up memory size limits, configure pipeline
run Logstash and benchmark it

create TLS certificates

clone repository https://github.com/jsvd/certificates
run make

build the plugin and configure Logstash

clone this repository and checkout the PR's branch
build it ./gradlew clean vendor
checkout the branch with the beats benchmarking tool: https://github.com/elastic/logstash/pull/15151

configure the plugin to use, edit Gemfileadding:

gem "logstash-input-beats", :path => "/Users/andrea/workspace/logstash_plugins/logstash-input-beats"

install bin/logstash-plugin install --no-verify

set up memory size limits, configure pipeline

in Logstash's config/jvm.options add:

-XX:MaxDirectMemorySize=128m
-XX:-MaxFDLimit

edit the test pipeline as


input {
beats {
port => 3333
ssl_enabled => true 
ssl_key => "/<path_to>/certificates/server_from_root.key"
ssl_certificate => "/<path_to>/certificates/server_from_root.crt"
ssl_client_authentication => "none"
}
}
filter {
ruby {
init => "Thread.new { loop { puts 'Direct mem: ' + Java::io.netty.buffer.ByteBufAllocator::DEFAULT.metric.used_direct_memory.to_s;sleep 5 } }"
code => ""
}
}

output{ sink {} }

- [optional] set ulimit properly if expects lots of connections

ulimit -S -n 1048576


### run Logstash and benchmark it
From a terminal launch Logstash process `bin/logstash -f path_to_config` and the benchmark script from another:
```sh
ruby -J-Xmx16g -J-XX:-MaxFDLimit benchmark_client.rb --test=beats -a yes

Expected outcome

The expectation is that direct memory consumption never goes up to the limit and if the client doesn't consume the ACKs messages (-a no) is the client that goes in error and not the Logstash side.

Test with real Filebeat

To start a bunch of Filebeat clients sending data to Beats input, just use the script ingest_all_hands/filebeat/multiple_filebeats.rb present in PRhttps://github.com/elastic/logstash/pull/15151

ruby multiple_filebeats.rb

It download a filebeat distribution, unpacks, generate some input file, prepare the data and logs folder for each Filebeat instance and run the processes.

Related issues

Closes #410
Fixes #309

Use cases

Screenshots

Logs

jsvd commented 1 year ago

something I'm observing is that maybe linger is not being set in the client connections, as I continue to see the same connection ids reporting oom errors well after the script that does load generation is gone:

https://gist.github.com/jsvd/a047320ff68bbe064e93dec0d6a251f7

andsel commented 1 year ago

@jsvd I used the PR #477 to test in isolation the BeatsParser, what I've observed (with netstat -anv -p tcp | grep <pid> | grep CLOSE_WAIT ) is tha:

the client creates 1500 connections and terminates quite fast
once it terminates, the 1500 connections go in CLOSE_WAIT state
the Netty side start reporting OOM exceptions or connection resets, for all the 1500 channels
at a certain point it finish to report exceptions, but this phase takes ~5 minutes

I think that when we reach an OOM error and we have the client that terminates immediately, we expect that all the 1500 channels also terminates immediately on Netty side, but in reality it takes minutes and this gives the idea of a looping error on the connections. Why it takes 5 minutes to stop logging channel's exceptions is not clear to me, maybe it's due to memory shortage.

Do you think that on server side, on first OOM error notification, should we close immediately all the other connections? In such case I think that it's an asynch request to the event loops responsible for the other connections, and it could also take such time, because at the end is what Netty is already doing.

andsel commented 1 year ago

@jsvd disabling of autoread is present in PR #485. When we will merge that PR, I'll remove the corresponding code from this, and refocus this to be second PR, to tackle the OOM problem.

robost001 commented 10 months ago

I tested PR475, manually merged it to 6.8.0 and started with logstash-7.17.17, then when calling V2Batch.realease I got a reference counting issue.

[2024-02-06T17:23:15,862][INFO ][org.logstash.beats.OOMConnectionCloser][main][c84edf99f631ad281446a904b8d1587b2f1505e2a620655c3115851f704f29e5] Direct
memory status, used: 27829207040, pinned: 23017497600, ratio: 83
[2024-02-06T17:23:15,862][WARN ][io.netty.channel.DefaultChannelPipeline][main][c84edf99f631ad281446a904b8d1587b2f1505e2a620655c3115851f704f29e5] An exc
eptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the excepti
on.
io.netty.handler.codec.DecoderException: io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499) ~[netty-codec-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[netty-codec-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.100.Final.jar:
4.1.100.Final]
        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61) ~[netty-transport-4.1.100.Final.jar:4.1.100.
Final]
        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:425) ~[netty-transport-4.1.100.Final.jar:4.1.100.Fina
l]
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:66) ~[netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
        at io.netty.util.internal.ReferenceCountUpdater.toLiveRealRefCnt(ReferenceCountUpdater.java:83) ~[netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.internal.ReferenceCountUpdater.release(ReferenceCountUpdater.java:148) ~[netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:101) ~[netty-buffer-4.1.100.Final.jar:4.1.100.Final]
        at org.logstash.beats.V2Batch.release(V2Batch.java:105) ~[logstash-input-beats-6.8.0.jar:?]
        at org.logstash.beats.BeatsParser.decode(BeatsParser.java:211) ~[logstash-input-beats-6.8.0.jar:?]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) ~[netty-codec-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) ~[netty-codec-4.1.100.Final.jar:4.1.100.Final]
        ... 10 more

logstash-plugins / logstash-input-beats