Beat stops processing events after OOM but keeps running

wkruse commented 6 years ago

I am running Logstash 6.2.0 in Docker with the following config:

input {
  beats {
    port => 5044
  }
}

filter {
}

output {
  redis {
    host => "redis"
    key => "logstash"
    data_type => "list"
    batch => true
    batch_events => 125
  }
}

and 6 GB heap size.

LS_JAVA_OPTS: -Xmx6g
PIPELINE_WORKERS: 4

When overloading Logstash with a lot of events, with logstash-input-beats (5.0.6) I see the same pattern as in elastic/logstash#9195 (used heap hits max heap, events processing stops). Beats stops processing the incoming events and crashes most of the time.

After finding #286, I did an update to logstash-input-beats (5.0.10), now used heap stays low, but I get a bunch of those:

[2018-03-01T16:28:02,678][WARN ][io.netty.channel.DefaultChannelPipeline] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 6419419652, max: 6425018368)
   at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.buffer.PoolArena.allocate(PoolArena.java:214) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.buffer.PoolArena.allocate(PoolArena.java:146) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:137) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:125) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-all-4.1.18.Final.jar:4.1.18.Final]
   at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.18.Final.jar:4.1.18.Final]
   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2018-03-01T16:28:02,678][INFO ][org.logstash.beats.BeatsHandler] [local: x.x.x.x:5044, remote: x.x.x.x:62324] Handling exception: failed to allocate 16777216 byte(s) of direct memory (used: 6419419652, max: 6425018368)

I would prefer it to reliably crash (so it can be restarted automatically) instead of hanging around and doing nothing.

praseodym commented 6 years ago

The exception handling in BeatsHandler seems to be designed to handle connection errors by closing the connection. As you noticed, it doesn’t handle ‘global’ exceptions like OOM exceptions correctly:

https://github.com/logstash-plugins/logstash-input-beats/blob/aaed27771dc81e220a8926bdcd017c504edb85b1/src/main/java/org/logstash/beats/BeatsHandler.java#L82-L98

As a workaround, consider adding -XX:+ExitOnOutOfMemoryError to LS_JAVA_OPTS. I’m not sure if this Netty-specific OOM exception will trigger a JVM exit with that option, though.

wkruse commented 6 years ago

@praseodym Thanks for the hint, I've tried it today, but it didn't trigger a JVM exit. I'll downgrade to logstash-input-beats (5.0.6) as it crashes most of the time.

wkruse commented 6 years ago

The exceptions in logstash-input-beats (5.0.6), when it doesn't crash on OOM:

[2018-03-05T09:02:37,270][ERROR][io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed to submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated
    at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:821) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:327) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:320) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:746) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:760) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:428) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.bootstrap.ServerBootstrap$ServerBootstrapAcceptor.channelRead(ServerBootstrap.java:255) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:93) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]

[2018-03-05T09:02:43,292][WARN ][io.netty.channel.AbstractChannel] Force-closing a channel whose registration task was not accepted by an event loop: [!!!io.netty.channel.socket.nio.NioSocketChannel@6c6705af=>java.lang.NoClassDefFoundError:Could not initialize class io.netty.buffer.ByteBufUtil$HexUtil!!!]
java.util.concurrent.RejectedExecutionException: event executor terminated
    at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:821) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:327) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:320) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:746) ~[netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.AbstractChannel$AbstractUnsafe.register(AbstractChannel.java:479) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.SingleThreadEventLoop.register(SingleThreadEventLoop.java:80) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.SingleThreadEventLoop.register(SingleThreadEventLoop.java:74) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.MultithreadEventLoopGroup.register(MultithreadEventLoopGroup.java:86) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.bootstrap.ServerBootstrap$ServerBootstrapAcceptor.channelRead(ServerBootstrap.java:255) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:93) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.18.Final.jar:4.1.18.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]

tomsommer commented 6 years ago

Try updating to 8.0.10

wkruse commented 6 years ago

@tomsommer I did try 5.0.10, it is in the very first comment. It doesn't crash at all, just stops working on OutOfDirectMemoryError in Netty.

wkruse commented 6 years ago

@praseodym I am running logstash-input-beats (5.0.6) with -XX:+ExitOnOutOfMemoryError for two days now, constantly overloading it. It keeps crashing every 10 min. or so, seems like this combination works for me.

sanky186 commented 6 years ago

I am getting the same behavior .. version 6.x

[2018-04-02T16:14:47,537][INFO ][org.logstash.beats.BeatsHandler] [local: 10.16.11.222:5044, remote: 10.16.11.67:42102] Handling exception: failed to allocate 83886080 byte(s) of direct memory (used: 4201761716, max: 4277534720)

paltryeffort commented 6 years ago

Hi, I'm running logstash 6.2.3 with logstash-input-beats (5.0.10) and I have the same issue.

[2018-04-09T07:21:08,066][WARN ][logstash.inputs.beats    ] Error when creating a connection {:exception=>"failed to allocate 16777216 byte(s) of direct memory (used: 2097152000, max: 2112618496)"}
[2018-04-09T07:21:08,066][WARN ][org.logstash.beats.Server] Exception caught in channel initializer io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 2097152000, max: 2112618496)

sanky186 commented 6 years ago

I got some useful advise on this thread https://github.com/elastic/logstash/issues/4781#issuecomment-377987836

Also , please post ur updates of upgrading to version 8 works for you

packetrevolt commented 6 years ago

I'm running Logstash 6.2.3 with logstash-input-beats 5.0.11 and facing the same issue.

I've enabled both -XX:+ExitOnOutOfMemoryError and -XX:MaxDirectMemorySize and this helps (it doesn't crash every 30 minutes anymore) but it will still eventually slow down to a crawl. Sometimes it is a few hours, sometimes up to a few days. The only resolution is to restart Logstash.

I have a few high volume FIlebeat agents have been switched to Redis because they kept crashing when sending to Logstash. The Filebeat log kept repeating "connection was reset" errors even though the rest of the beats agents were still working fine. These high volume Filebeat agents average around 5k/s and 10Mbps uncompressed each.

The problem as experience is after the period of time, event rate will fall off to around 1% of previous. If it is the particular Filebeat agent that is affected, then that one agent will be 1% of previous event rate but everything else is still operating at full speed. If it is the logstash-beats-input then the entire Logstash process (including other inputs) will slow down to 1% of previous event rate

All of this started in the logstash-input-beats 5.0.6, previous versions worked 100% perfectly.

packetrevolt commented 6 years ago

We have switched back to logstash-input-beats v5.0.6 on Logstash 6.2.3. None of the newer versions (5.0.8-5.0.14) would work, all would eventually OOM and slow down the whole Logstash process and all other input plugins until Logstash would eventually crash.

Increasing the JVM size would cause Logstash to OOM quicker and crash sooner, reducing the JVM size would cause the OOM and crash to be much more gradual. Eventually it would OOM and crash which would cause the various Beats to queue up, once Logstash was restarted they start to flush the queue, which causes Logstash to OOM and crash sooner.

Message load is about 10k/s per Logstash beat input, mostly Winlogbeat.

robbavey commented 6 years ago

@packetrevolt Sorry to hear that.

How many beats nodes do you have, and what is the typical (uncompressed) message size? With the beats configuration, what are the values for bulk_max_size and pipelining? Reducing those can help, particularly when large uncompressed messages are being sent.

packetrevolt commented 6 years ago

@robbavey

~800 beats nodes

bulk_max_size and pipelining are default. I tried reducing them but on the higher volume beats they started having trouble keeping up with the message rate and would fall behind and never catch back up. Some of the higher volume Beats will individually be 1-2k/s each easily.

The largest volume Beats (bursts of 5k/s+) I had to convert them to sending to Redis instead of Logstash (Domain Controllers and IDS sensors mostly) because they would continually fall behind and/or crash Logstash once logstash-input-beats >5.0.6. Upgrading logstash-input-beats was the only difference, same beat version and same config in both Logstash/Beats.

Compression is default, level 3 I think it is.

praseodym commented 6 years ago

Having to tune the Beats sending-side sounds like a terrible workaround considering that the protocol supports backpressure, so the server should be able to indicate when it is too busy to handle the workload. Also, in large production environments with hundreds of Beats nodes deployed it can be challenging to reconfigure them all.

packetrevolt commented 6 years ago

This is the pattern I see. These screenshot is from Logstash 6.2.3 and logstash-input-beats 5.0.14. After a service restart the event rate stays high and latency low for 4-6 hours, after which the event rate starts dropping and the latency increasing. The only way this recovers is to restart Logstash.

This Logstash server has zero filters, it receives from beats/tcp/udp inputs and dumps into Redis (to be processed by a different Logstash server).

yongfa365 commented 5 years ago

same problem,memory old grows up then stop work, _node/stats not response

no die,no running

rwaweber commented 5 years ago

Sorry for inadverdently +1'ing this, but is there any chance we might get a look at this sometime soon?

This problem still exists in the latest version and the only real way to keep our pipelines stable is to do like what @packetrevolt suggested, and simply restart logstash as memory usage gets high, which is also not an ideal workaround. More than happy to provide diagnostic information as needed.

nitharsandvag commented 4 years ago

Same problem on Logstash 7.4.0 and logstash-input-beats-6.0.1

gwsales commented 4 years ago

Having the same problem here.

nitharsandvag commented 4 years ago

We got that issue fixed! It was a problem with VMware, the allocated RAM was not available on VMware host. Now we allocated fixed RAM size to the VM and is working fine. Hope it may help others to solve the issue.

luann-trend commented 3 years ago

We have experienced the same issue with Logstash 7.4.2, Beat input plugin version 6.0.3. Logstash run in docker, with 4G of heap, and 6G of memory allocated for the container pod.

KowMangler commented 1 year ago

Having the exact same experience with 8.6.1 and upgraded to 8.8.1 crossing my fingers it was fixed. It is not. We hit this issue after about 20 hours of uptime.

[INFO ][org.logstash.beats.BeatsHandler][main][lumberjack-input] [local: 10.42.21.75:8001, remote: 10.42.3.0:50643] Handling exception: java.lang.OutOfMemoryError: Cannot reserve 16777216 bytes of direct buffer memory (allocated: 4279361544, limit: 4294967296) (caused by: java.lang.OutOfMemoryError: Cannot reserve 16777216 bytes of direct buffer memory (allocated: 4279361544, limit: 4294967296))

if the architecture matters we are using logstash(2000+ instances) -> LB -> logstash(30 instances) -> elasticsearch where receiving logstash instances are behind a load-balancer and all logstash instances are run in kubernetes.

This was all working fine for about 3 days at this volume then suddenly can't go 24 hours without the problem arising. Prior to a 3% increase in event volume we picked up by scraping more infrastructure we have been ~30 days of the logstash instances emitting 45k-50k events/s collectively. There is not increase in events/s (at least not noticeable) since then. Based on this thread we may switch to the http module instead of beats.

When we look at telemetry for a logstash instance that's healthy:

Prior to the bad behavior we find the logstash instances have 400-500 lines in log like:

[INFO ][org.logstash.beats.BeatsHandler][main][lumberjack-input] [local: 10.42.21.75:8888, remote: 10.42.3.0:4322] Handling exception: java.net.SocketException: Connection reset (caused by: java.net.SocketException: Connection reset)

Before we see the OOM errors.

andsel commented 1 year ago

To minimize the possibility to go OOM there are some strategies to implement in Beats input that could mitigate this problem.

move from push reads to pull reads: in this way the plugin reads byte arrays from Netty's channels, asking bytes of not present. It's not the channel that pushes every byte it receives in memory, creating buffers that keeps growing. In case of slow consumer, like an high contention to writing Logstash events into the queue, the buffers are requested only when ready to be consumed. So this should exert back pressure on the upstream.
reject new connections when close to max direct memory: when a new connection is established and the direct memory consumption is close to the max, rejecting the connection should be a proactive measure to don't reach the OOM.
close channel on OOM: with the current management of exception, when an error happens the connection is closed. This already is a measure to self protect, but it's not such explicit when it happens, because the stacktraces are simply logged. A better recognition and logging of the condition error would help in understanding when it happens.

logstash-plugins / logstash-input-beats

Beat stops processing events after OOM but keeps running #309