Closed nilavalagansugumaran closed 4 years ago
Can you try with boot 2.1.8 and Greenwich.SR3?
Thanks a lot for getting back to me @spencergibb . I will try the versions suggested and get back with the observations
@spencergibb - We upgraded gateway with the suggested versions (boot 2.1.8 and Greenwich.SR3). We started seeing stall issues after 20 to 25 mins of running it in production. The stack dump is as follows and now comes with additional FastThreadLocalRunnable
entry -
Thread Name:reactor-http-epoll-5 ID:4971 Time:Wed Sep 18 09:30:52 BST 2019 State:RUNNABLE Priority:5 io.netty.channel.epoll.Native.epollWait0(Native Method)
io.netty.channel.epoll.Native.epollWait(Native.java:96)
io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:276)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:305)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:748)
We also still see the LEAK: ByteBuf.release() errors but no out-of-memory errors, no open descriptors issues and no direct memory issues.
Please let us know if you need any additional information.
Stall connections -
@spencergibb - Is this issue due to Trasnfer-Encoding
header in the response set as chunked
? A lot of discussion around netty and threads stall in - https://github.com/tomakehurst/wiremock/issues/914 . Please can you advice? We can see the response has Transfer-Encoding
set as chunked
when returned from gateway-server. Request headers dont have Transfer-Encoding
Hi @spencergibb / @violetagg / @smaldini - please can you help us on this? We waited for stable versions as our attempts last year failed. We are not able to release gateway this time around as well. Most of the problems, that we saw last year, got resolved but the new one with stall connections/threads is stopping us in going live. Any help is much appreciated.
Hi @spencergibb / @violetagg / @smaldini - We also tried deploying again by overriding netty versions as below and spring boot version 2.1.9.RELEASE
with dependencies - Greenwich.SR3
but still having the stall threads -
compile "io.netty:netty-buffer:4.1.42.Final"
compile "io.netty:netty-codec:4.1.42.Final"
compile "io.netty:netty-codec-http2:4.1.42.Final"
compile "io.netty:netty-codec-http:4.1.42.Final"
compile "io.netty:netty-codec-socks:4.1.42.Final"
compile "io.netty:netty-common:4.1.42.Final"
compile "io.netty:netty-handler:4.1.42.Final"
compile "io.netty:netty-handler-proxy:4.1.42.Final"
compile "io.netty:netty-resolver:4.1.42.Final"
compile "io.netty:netty-transport:4.1.42.Final"
compile "io.netty:netty-transport-native-epoll:4.1.42.Final"
compile "io.netty:netty-transport-native-unix-common:4.1.42.Final"
compile "io.projectreactor.addons:reactor-extra:3.3.0.RELEASE"
compile "io.projectreactor:reactor-core:3.3.0.RELEASE"
Stack Dump -
Thread Name:reactor-http-epoll-7 ID:3772 Time:Tue Oct 08 15:21:52 BST 2019 State:RUNNABLE Priority:5 io.netty.channel.epoll.Native.epollWait0(Native Method)
io.netty.channel.epoll.Native.epollWait(Native.java:101)
io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:304)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:355)
io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:748)
Hi @spencergibb / @violetagg / @smaldini - Please can you direct us to the right direction on this issue? We also see this issue in an EKS cluster which eliminates OS is a possible issue as someone indicated in the thread - https://developer.jboss.org/thread/274758?_sscc=t
We are trying canary deployment approach to try taking spring-gateway to production for last two months as we tried last year but failing due to stall threads.
Running 12 zuul instances
Running 1 spring-gateway instance
We start seeing stall threads between 30 and 60 mins in production. Over a period of time accumulating these stall threads could stop the application/server. Stalls occurs randomly and across all routs. The stack dump is same every time and on every thread.
As mentioned in my previous comment, upgrading the netty or spring boot versions didnt help.
Any help is much appreciated.
@nilavalagansugumaran Can you try the following: Reactor Netty 0.8.13.BUILD-SNAPSHOT Netty 4.1.43.Final Spring Boot 2.1.9.RELEASE Spring Cloud Gateway 2.1.4.BUILD-SNAPSHOT
Thanks a lot @violetagg . I will try these builds and and get back to you as soon as possible
Hi @violetagg - We still have the stall threads issue with similar stack dumps.
Stack Dump
Thread Name:reactor-http-epoll-1 ID:34 Time:Thu Oct 31 14:18:27 GMT 2019 State:RUNNABLE Priority:5
io.netty.channel.epoll.Native.epollWait(Native Method)
io.netty.channel.epoll.Native.epollWait(Native.java:126)
io.netty.channel.epoll.Native.epollWait(Native.java:119)
io.netty.channel.epoll.EpollEventLoop.epollWaitNoTimerChange(EpollEventLoop.java:317)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:375) io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:748)
Stack Dump
Thread Name:reactor-http-epoll-2 ID:35 Time:Thu Oct 31 14:18:27 GMT 2019 State:RUNNABLE Priority:5
io.netty.channel.epoll.Native.epollWait(Native Method)
io.netty.channel.epoll.Native.epollWait(Native.java:126)
io.netty.channel.epoll.Native.epollWait(Native.java:119)
io.netty.channel.epoll.EpollEventLoop.epollWaitNoTimerChange(EpollEventLoop.java:317)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:375)
io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:748)
@nilavalagansugumaran Can you try the following: Spring Boot 2.2.6.RELEASE Spring Cloud Gateway 2.2.2.RELEASE (Hoxton.SR3) This will include the latest reactor netty and netty @violetagg anything else he should try?
@violetagg anything else he should try?
Let's see whether the issue is there when using the latest releases.
If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.
Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open the issue.
Not able to get this resolved or addressed, sadly we had to revert back to zuul.
We are seeing exact same issue under heavy load and also using AppDynamics (maybe instrumentation is the issue).
Oracle Java 11.0.9 [INFO] | +- org.springframework.boot:spring-boot-starter-reactor-netty:jar:2.3.3.RELEASE:compile [INFO] | | - io.projectreactor.netty:reactor-netty:jar:0.9.11.RELEASE:compile [INFO] | | +- io.netty:netty-codec-http:jar:4.1.51.Final:compile [INFO] | | +- io.netty:netty-codec-http2:jar:4.1.51.Final:compile [INFO] | | +- io.netty:netty-handler-proxy:jar:4.1.51.Final:compile [INFO] | | | - io.netty:netty-codec-socks:jar:4.1.51.Final:compile [INFO] | | - io.netty:netty-transport-native-epoll:jar:linux-x86_64:4.1.51.Final:compile [INFO] | | - io.netty:netty-transport-native-unix-common:jar:4.1.51.Final:compile
We are unable to reproduce in test environments.
@amirkovic I don't know whether this is related but you should upgrade at least to Reactor Netty 0.9.14.RELEASE. There is a regression in 0.9.11.RELEASE. https://github.com/reactor/reactor-netty/releases/tag/v0.9.14.RELEASE
Thank you, we will try that and report results.
@violetagg We did upgrade our reactor-netty version to 0.9.14.RELEASE, but we still notice stall requests via AppD monitoring.
Below is the stack trace from the monitoring log. [reactor-http-epoll-2] 15 Mar 2021 12:12:25,156 WARN AgentErrorProcessor - Agent error occurred, [name,transformId]=[com.singularity.tm.ReactorNettyAsyncEntrySt artInterceptorV08x - java.lang.NullPointerException,148] [reactor-http-epoll-2] 15 Mar 2021 12:12:25,156 ERROR ReactorNettyAsyncEntryStartInterceptorV08x - Error in TEP : onMethodBeginTracked for : MethodExecutionEnvir onment{ invokedObject='{ Class='reactor.netty.http.server.HttpServerOperations', Hash code=1569420149 }', className='reactor.netty.http.server.HttpServerOperations', methodName='onInboundNext', paramValues=[ { Class='io.netty.channel.DefaultChannelHandlerContext', Hash code=1671796612 }, { Class='io.netty.handler.codec.http.DefaultHttpRequest', Hash code=1548465446 } ]}, transactionContext local copy: CurrentTransactionContext[businessTransaction s=Business Transaction [/actuator/health[530152]] Entry Point Type [SERVLET] Component ID [47302], entryPointType=SERVLET, currentExitCall=[NULL], hashcode=18952 27051], transactionContext fetched from btContext CurrentTransactionContext[businessTransactions=Business Transaction [/actuator/health[530152]] Entry Point Type [SERVLET] Component ID [47302], entryPointType=SERVLET, currentExitCall=[NULL], hashcode=1895227051] java.lang.NullPointerException: null at com.singularity.ee.agent.appagent.services.transactionmonitor.eum.EUMMetadataInjector.getUserAgentString(EUMMetadataInjector.java:126) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.eum.EUMMetadataInjector.okayToWriteCookie(EUMMetadataInjector.java:89) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.eum.EUMContext.injectMetadata(EUMContext.java:531) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.eum.EUMContext.startEndUserRequest(EUMContext.java:150) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.eum.EUMContext.notifyBTStart(EUMContext.java:264) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.eum.EUMContext.notifyBTStart(EUMContext.java:242) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.common.TransactionDataHandler.init(TransactionDataHandler.java:122) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.common.TransactionDataHandler.initTransaction(TransactionDataHandler.java:131) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.common.ATransactionEntryPointInterceptor._onMethodBeginTracked(ATransactionEntryPointInt erceptor.java:292) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.common.ATransactionEntryPointInterceptor.onMethodBeginTracked(ATransactionEntryPointInte rceptor.java:166) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.http.servlet.ServletInterceptor.onMethodBeginTracked(ServletInterceptor.java:129) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.async2.AAsyncHttpEntryInterceptor.onMethodBeginTracked(AAsyncHttpEntryInterceptor.java:6 3) ~[?:?] at com.singularity.ee.agent.appagent.services.transactionmonitor.http.correlation.webflux.reactor.netty.ReactorNettyAsyncEntryStartInterceptorV08x.onMeth odBeginTracked(ReactorNettyAsyncEntryStartInterceptorV08x.java:104) ~[?:?] at com.singularity.ee.agent.appagent.services.bciengine.AFastTrackedMethodInterceptor.onMethodBegin(AFastTrackedMethodInterceptor.java:52) ~[appagent-boo t.jar:?] at com.singularity.ee.agent.appagent.kernel.bootimpl.FastMethodInterceptorDelegatorImpl.safeOnMethodBeginNoReentrantCheck(FastMethodInterceptorDelegatorI mpl.java:370) ~[?:?] at com.singularity.ee.agent.appagent.kernel.bootimpl.FastMethodInterceptorDelegatorImpl.safeOnMethodBegin(FastMethodInterceptorDelegatorImpl.java:295) ~[ ?:?] at com.singularity.ee.agent.appagent.entrypoint.bciengine.FastMethodInterceptorDelegatorBoot.safeOnMethodBegin(FastMethodInterceptorDelegatorBoot.java:52 ) ~[?:Server Agent #20.4.0.29862 v20.4.0 GA compatible with 4.4.1.0 r23226cf913828e244d2d32691aac97efccf39724 release/20.4.0] at reactor.netty.http.server.HttpServerOperations.onInboundNext(HttpServerOperations.java) ~[?:?] at reactor.netty.channel.ChannelOperationsHandler.channelRead(ChannelOperationsHandler.java:96) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?] at reactor.netty.http.server.HttpTrafficHandler.channelRead(HttpTrafficHandler.java:172) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?] at reactor.netty.http.server.AccessLogHandler.channelRead(AccessLogHandler.java:51) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?] at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) ~[?:?] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) ~[?:?] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) ~[?:?] at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?] at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792) ~[?:?] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) ~[?:?] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[?:?] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[?:?] at java.lang.Thread.run(Thread.java:834) [?:?]
@aravinthr7 are you using spring cloud gateway?
@spencergibb yes spring-cloud-gateway with netflix ribbon
spring-cloud-dependencies: Hoxton.SR10 spring-boot-dependencies: 2.3.3.RELEASE reactor-netty: 0.9.14.RELEASE
@spencergibb @violetagg Any suggestion/update on above issue reported by @aravinthr7 and @amirkovic ?
@amyc997 I cannot comment this NullPointerException
java.lang.NullPointerException: null
at com.singularity.ee.agent.appagent.services.transactionmonitor.eum.EUMMetadataInjector.getUserAgentString(EUMMetadataInjector.java:126) ~[?:?]
Also in the stack trace I see ReactorNettyAsyncEntryStartInterceptorV08x
so I don't know what this interceptor is supposed to do, whether it is applicable only for Reactor Netty 0.8.x and whether it can be used with Reactor Netty 0.9.x/1.0.x
com.singularity.ee.agent.appagent.services.transactionmonitor.http.correlation.webflux.reactor.netty.ReactorNettyAsyncEntryStartInterceptorV08x
@spencergibb @violetagg I see the issue is closed. Any update or conclusion on above issue reported by @nilavalagansugumaran?
We have a similar problem with spring cloud gateway. We use it in a k8s environment. Each time we scale the pod up to another instance, the new instance breaks imediatly under the load. i.e. one service can handle ~700 requests/sec, so we defined the k8s hpa to scale the pod up at 650 req/sec. * when we loadtest with ~600req/sec one pod can handle all, but the moment we increase the load to ~700 req/sec, k8s spawns another pod and this new pod breaks imediatly under the load of ~350 req/sec.
* numbers depend on the ressources allocated to the pod, but the result is still the same.
We attempt to release gateway to replace ZUUL with few pre/post filters that validates request/response headers, processes request and response body. We failed last year with spring gateway 1.X versions due to netty memory leak, open descriptors and direct memory issues. Recently we have upgraded gateway with latest versions as below,
We no longer see direct-memory issues and open descriptor issues. Netty memory leak issues is still there but we think this is related to - https://github.com/spring-cloud/spring-cloud-gateway/issues/810 which looks manageable as the application is failing with out-of-memory issues as before. However we are unable to release the latest version to production due to stuck/stall connections, requests. This issue is new and haven't seen last year. Requests are stuck/stall for long long time with below stack dump (captured via
Appdynamics
). Attached screenshots that shows some of the requests stalled for a long time.Issue is not specific to one single route or one single HTTP operation but occurs randomly every 10/20 or 30 mins period and stall requests keep accumulating. Tried tuning by setting threads count, response, connect timeouts but none helped so far.
Please can you advise if you have seen these stalls before and any remedies that would help us releasing spring-gateway to production on this attempt? Also can you advise if we are missing/wrongly configured any timeout parameters.
Our application handles high traffic (> 600 requests per minute on a each of 10 ZUUL instances) and supports all key HTTP operations. We are unable to replicate the issue locally with a simple demo gateway application - generated load via
jmeter
, artificially created timeouts and latency with the help oftoxiproxy
. Below is our spring-gateway configuration,Stall response times -