Closed starsliao closed 3 months ago
Is this a custom http server implemented on top of netty? Or are you using some framework. As far as I can tell there are a couple of long running connections that have processed a lot of requests. Connections that don't serve too many requests shouldn't cause this issue as the stale data would get cleaned when the connection is closed. Probably the issue is in https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/ade7c8072031a5a7fb284695b07059db6949ac1a/instrumentation/netty/netty-4.1/library/src/main/java/io/opentelemetry/instrumentation/netty/v4_1/internal/server/HttpServerResponseTracingHandler.java#L53 where server contexts are removed and spans are ended only on certain inputs to the write
method. It would help to know what the server code is sending to the write method.
@laurit Thank you for your answer. After communicating with the development colleagues, it was confirmed that this microservice uses Spring Boot Tomcat as the web container and doesn't use Netty.
However, most of our microservices communicate with xxljob, which is a long connection and has a heartbeat detection to maintain the long connection. Xxljob uses Netty. So I suspect if this scenario has caused the memory resources of the microservice not to be released.
Could opentelemetry-java-instrumentation be optimized for such a long connection scenario? Or are there any other ways to avoid this problem?
I think it is actually xxl-remoting not xxl-job that triggers the issue. What version of xxl-remoting are you using.
Could opentelemetry-java-instrumentation be optimized for such a long connection scenario? Or are there any other ways to avoid this problem?
Sure, we gladly accept pull requests that fix issues.
I think this happens because of https://github.com/xuxueli/xxl-rpc/blob/eeaa1bd7fc8f2249de13f971dda4f6689d66f318/xxl-rpc-core/src/main/java/com/xxl/rpc/core/remoting/net/impl/netty_http/server/NettyHttpServerHandler.java#L85-L88 There is no response for heartbeat requests. Our assumption is that every request has a matching response. When there is a request without a response we'll miss cleaning up.
I think this happens because of https://github.com/xuxueli/xxl-rpc/blob/eeaa1bd7fc8f2249de13f971dda4f6689d66f318/xxl-rpc-core/src/main/java/com/xxl/rpc/core/remoting/net/impl/netty_http/server/NettyHttpServerHandler.java#L85-L88 There is no response for heartbeat requests. Our assumption is that every request has a matching response. When there is a request without a response we'll miss cleaning up.
Thank you for your analysis. I will relay your description to our development team shortly.
We are currently attempting to restart the XXL-Job service. After doing so, the memory of the microservices experiencing heap memory leaks has been released.
Microservices with Memory Leaks:Weekly Memory Usage Trend Chart:
before :
after:
I encountered the issue of memory leaks with Netty. I am using version 2.5.0 of autoinstrumentation-java, and I have experienced the same problem.
In long-running Java microservices (running for more than 20 days,high volume of requests), there is insufficient Java heap memory. Many microservices are experiencing this issue, and some of these microservices are not even using Netty.
I previously had the same issue when using version 2.3.0 of autoinstrumentation-java.
This is latest Java dump file.
I am an operations engineer, and this is the phenomenon I observed. Below is the screenshot information provided by my development colleagues.
Originally posted by @starsliao in https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/11399#issuecomment-2267608218