open-telemetry / opentelemetry-java-instrumentation

OpenTelemetry auto-instrumentation and instrumentation libraries for Java
https://opentelemetry.io
Apache License 2.0
1.96k stars 860 forks source link

Memory leak in netty4.1 io.netty.channel.socket.nio.NioSocketChannel #11942

Closed starsliao closed 3 months ago

starsliao commented 3 months ago

I encountered the issue of memory leaks with Netty. I am using version 2.5.0 of autoinstrumentation-java, and I have experienced the same problem.

In long-running Java microservices (running for more than 20 days,high volume of requests), there is insufficient Java heap memory. Many microservices are experiencing this issue, and some of these microservices are not even using Netty.

I previously had the same issue when using version 2.3.0 of autoinstrumentation-java.

This is latest Java dump file.

I am an operations engineer, and this is the phenomenon I observed. Below is the screenshot information provided by my development colleagues.

企业微信截图_17200595874833

企业微信截图_17200793969702

企业微信截图_17200797606265

企业微信截图_17200793969702

Originally posted by @starsliao in https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/11399#issuecomment-2267608218

starsliao commented 3 months ago

图片 图片

laurit commented 3 months ago

Is this a custom http server implemented on top of netty? Or are you using some framework. As far as I can tell there are a couple of long running connections that have processed a lot of requests. Connections that don't serve too many requests shouldn't cause this issue as the stale data would get cleaned when the connection is closed. Probably the issue is in https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/ade7c8072031a5a7fb284695b07059db6949ac1a/instrumentation/netty/netty-4.1/library/src/main/java/io/opentelemetry/instrumentation/netty/v4_1/internal/server/HttpServerResponseTracingHandler.java#L53 where server contexts are removed and spans are ended only on certain inputs to the write method. It would help to know what the server code is sending to the write method.

starsliao commented 3 months ago

@laurit Thank you for your answer. After communicating with the development colleagues, it was confirmed that this microservice uses Spring Boot Tomcat as the web container and doesn't use Netty.

However, most of our microservices communicate with xxljob, which is a long connection and has a heartbeat detection to maintain the long connection. Xxljob uses Netty. So I suspect if this scenario has caused the memory resources of the microservice not to be released.

Could opentelemetry-java-instrumentation be optimized for such a long connection scenario? Or are there any other ways to avoid this problem?

laurit commented 3 months ago

I think it is actually xxl-remoting not xxl-job that triggers the issue. What version of xxl-remoting are you using.

Could opentelemetry-java-instrumentation be optimized for such a long connection scenario? Or are there any other ways to avoid this problem?

Sure, we gladly accept pull requests that fix issues.

laurit commented 3 months ago

It is actually called xxl-rpc not xxl-remoting.

laurit commented 3 months ago

I think this happens because of https://github.com/xuxueli/xxl-rpc/blob/eeaa1bd7fc8f2249de13f971dda4f6689d66f318/xxl-rpc-core/src/main/java/com/xxl/rpc/core/remoting/net/impl/netty_http/server/NettyHttpServerHandler.java#L85-L88 There is no response for heartbeat requests. Our assumption is that every request has a matching response. When there is a request without a response we'll miss cleaning up.

starsliao commented 3 months ago

I think this happens because of https://github.com/xuxueli/xxl-rpc/blob/eeaa1bd7fc8f2249de13f971dda4f6689d66f318/xxl-rpc-core/src/main/java/com/xxl/rpc/core/remoting/net/impl/netty_http/server/NettyHttpServerHandler.java#L85-L88 There is no response for heartbeat requests. Our assumption is that every request has a matching response. When there is a request without a response we'll miss cleaning up.

Thank you for your analysis. I will relay your description to our development team shortly.

We are currently attempting to restart the XXL-Job service. After doing so, the memory of the microservices experiencing heap memory leaks has been released.

Microservices with Memory Leaks:Weekly Memory Usage Trend Chart: 图片

before : 图片

after: 图片