spring-cloud / spring-cloud-gateway

An API Gateway built on Spring Framework and Spring Boot providing routing and more.
http://cloud.spring.io
Apache License 2.0
4.54k stars 3.33k forks source link

LocalResponseCache leaks memory #2879

Closed MatthiasZerau closed 1 year ago

MatthiasZerau commented 1 year ago

When running Spring cloud Gateway with the LocalResponseCache active out of memory errors occur after after some time the application is running. The OOM happen while Netty trying to allocate native memory, for example:

"2023-02-27 10:39:48,649 [reactor-http-epoll-4] ERROR r.c.p.Operators - Operator called default onErrorDropped"
"reactor.netty.ReactorNetty$InternalNettyException: java.lang.OutOfMemoryError: Cannot reserve 4194304 bytes of direct buffer memory (allocated: 533152340, limit: 536870912)"
"Caused by: java.lang.OutOfMemoryError: Cannot reserve 4194304 bytes of direct buffer memory (allocated: 533152340, limit: 536870912)"
    at java.base/java.nio.Bits.reserveMemory(Unknown Source)
    at java.base/java.nio.DirectByteBuffer.<init>(Unknown Source)
    at java.base/java.nio.ByteBuffer.allocateDirect(Unknown Source)
    at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:701)
    at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:676)
    at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
    at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:197)
    at io.netty.buffer.PoolArena.allocate(PoolArena.java:139)
    at io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
    at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:396)
    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
    at io.netty.channel.unix.PreferredDirectByteBufAllocator.ioBuffer(PreferredDirectByteBufAllocator.java:53)
    at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:120)
    at io.netty.channel.epoll.EpollRecvByteAllocatorHandle.allocate(EpollRecvByteAllocatorHandle.java:75)
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:785)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:425)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Unknown Source)

Tracking down the issue I came to the conclusion that the LocalresponseCache may be the root of this issue. Once I disabled this filter, the OOM where no longer observable.

Java 17 Spring Boot 3.0.3 Spring Cloud 2022.0.1

A reproducible example can be found here. Just run gradlew build from the command line. The unit test LocalResponseCacheEnabledTests is expected to fail. The log output of the test should contain errors similar to the following one:

2023-03-03T10:28:17.897+01:00  INFO 25868 --- [    Test worker] c.e.s.LocalResponseCacheEnabledTests     : Started LocalResponseCacheEnabledTests in 1.742 seconds (process running for 7.2)
2023-03-03T10:28:18.839+01:00 ERROR 25868 --- [ctor-http-nio-6] i.n.u.ResourceLeakDetector               : LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
#1:
    io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:300)
    io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
    io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
    io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
    io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
    io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
    io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
    io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    java.base/java.lang.Thread.run(Thread.java:833)
#2:
    io.netty.buffer.AdvancedLeakAwareByteBuf.readRetainedSlice(AdvancedLeakAwareByteBuf.java:107)
    io.netty.handler.codec.http.HttpObjectDecoder.decode(HttpObjectDecoder.java:362)
    io.netty.handler.codec.http.HttpClientCodec$Decoder.decode(HttpClientCodec.java:239)
....
ilozano2 commented 1 year ago

The counter used by Netty has been shadowed by the wrap method returned by the response decorator

https://github.com/spring-cloud/spring-cloud-gateway/blob/main/spring-cloud-gateway-server/src/main/java/org/springframework/cloud/gateway/filter/factory/cache/ResponseCacheManager.java#L101

The leak test works when that line is changed to

return dataBuffer;

so Netty could decrease the counter of the databufffer (using the release() method) instead of using the wrapper, leaving the other one with a count > 1.

Anyway, I am checking the implementation to get some improvement apart from removing the issue. Maybe we can reduce the memory footprint.

ilozano2 commented 1 year ago

Current OSS version with the leak

image

Applying the fix

This chart looks better. You can see a sawtooth shape representing automatic GCs to release heap memory. I've left the app running sometime and the heap memory used decreased back to normal.

image

spencergibb commented 1 year ago

Closing in favor of #2886