jchambers / pushy

A Java library for sending APNs (iOS/macOS/Safari) push notifications
https://pushy-apns.org/
MIT License
1.8k stars 450 forks source link

Possible bug in ApnsClientHandler::channelInactive #857

Closed janzar closed 3 years ago

janzar commented 3 years ago

Hello, this weekend seems that Apple had some problems with pushes. We have seen lot of reset stream, etc. errors, and finally get Full GC on our servers. Looks like ApnsClientHandler::unattachedResponsePromisesByStreamId had not cleaned. From logs had found following suspicious stack traces. Seems that if Http2ConnectionHandler::channelInactive fails with exception, then cleanup in ApnsClientHandler or TokenAuthenticationApnsClientHandler was not done :

Mar 12 23:23:10  [43] io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place
Mar 12 23:23:10  [43]  at io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:167)
Mar 12 23:23:10  [43]  at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState.cancel(DefaultHttp2RemoteFlowController.java:481)
Mar 12 23:23:10  [43]  at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$1.onStreamClosed(DefaultHttp2RemoteFlowController.java:105)
Mar 12 23:23:10  [43]  at io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:356)
Mar 12 23:23:10  [43]  at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1000)
Mar 12 23:23:10  [43]  at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.deactivate(DefaultHttp2Connection.java:956)
Mar 12 23:23:10  [43]  at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:512)
Mar 12 23:23:10  [43]  at io.netty.handler.codec.http2.DefaultHttp2Connection.close(DefaultHttp2Connection.java:152)
Mar 12 23:23:10  [43]  at io.netty.handler.codec.http2.Http2ConnectionHandler$BaseDecoder.channelInactive(Http2ConnectionHandler.java:209)
Mar 12 23:23:10  [43]  at io.netty.handler.codec.http2.Http2ConnectionHandler.channelInactive(Http2ConnectionHandler.java:417)
Mar 12 23:23:10  [43]  at com.eatthepath.pushy.apns.ApnsClientHandler.channelInactive(ApnsClientHandler.java:499)
Mar 12 23:23:10  [43]  at com.eatthepath.pushy.apns.TokenAuthenticationApnsClientHandler.channelInactive(TokenAuthenticationApnsClientHandler.java:143)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
Mar 12 23:23:10  [43]  at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
Mar 12 23:23:10  [43]  at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
Mar 12 23:23:10  [43]  at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:389)
Mar 12 23:23:10  [43]  at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:354)
Mar 12 23:23:10  [43]  at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1121)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
Mar 12 23:23:10  [43]  at io.netty.handler.proxy.ProxyHandler.channelInactive(ProxyHandler.java:231)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
Mar 12 23:23:10  [43]  at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:418)
Mar 12 23:23:10  [43]  at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:223)
Mar 12 23:23:10  [43]  at io.netty.handler.proxy.HttpProxyHandler$HttpClientCodecWrapper.channelInactive(HttpProxyHandler.java:267)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
Mar 12 23:23:10  [43]  at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
Mar 12 23:23:10  [43]  at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
Mar 12 23:23:10  [43]  at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:819)
Mar 12 23:23:10  [43]  at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
Mar 12 23:23:10  [43]  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
Mar 12 23:23:10  [43]  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
Mar 12 23:23:10  [43]  at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
Mar 12 23:23:10  [43]  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
Mar 12 23:23:10  [43]  at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
Mar 12 23:23:10  [43]  at java.lang.Thread.run(Thread.java:745)
janzar commented 3 years ago

Heh. Seems that it is just stack trace where error was created...

jchambers commented 3 years ago

It sounds like everything's getting garbage-collected eventually, though I do agree that there's room to improve. I have a proposed fix in #858.

janzar commented 3 years ago

Hm, I wondering about this stack trace. why there is no com.eatthepath.pushy.apns.ApnsClientHandler stacktrace:

Mar 12 23:28:47   [43] io.netty.handler.codec.http2.StreamBufferingEncoder$Http2ChannelClosedException: Connection closed
Mar 12 23:28:47   [43]  at io.netty.handler.codec.http2.StreamBufferingEncoder.close(StreamBufferingEncoder.java:228)
Mar 12 23:28:47   [43]  at io.netty.handler.codec.http2.Http2ConnectionHandler$BaseDecoder.channelInactive(Http2ConnectionHandler.java:204)
Mar 12 23:28:47   [43]  at io.netty.handler.codec.http2.Http2ConnectionHandler.channelInactive(Http2ConnectionHandler.java:417)
Mar 12 23:28:47   [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
Mar 12 23:28:47   [43]  at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
Mar 12 23:28:47   [43]  at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:418)
Mar 12 23:28:47   [43]  at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:223)
Mar 12 23:28:47   [43]  at io.netty.handler.proxy.HttpProxyHandler$HttpClientCodecWrapper.channelInactive(HttpProxyHandler.java:267)
Mar 12 23:28:47   [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
Mar 12 23:28:47   [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
Mar 12 23:28:47   [43]  at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
Mar 12 23:28:47   [43]  at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
Mar 12 23:28:47   [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
Mar 12 23:28:47   [43]  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
Mar 12 23:28:47   [43]  at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
Mar 12 23:28:47   [43]  at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:819)
Mar 12 23:28:47   [43]  at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
Mar 12 23:28:47   [43]  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
Mar 12 23:28:47   [43]  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
Mar 12 23:28:47   [43]  at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
Mar 12 23:28:47   [43]  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
Mar 12 23:28:47   [43]  at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
Mar 12 23:28:47   [43]  at java.lang.Thread.run(Thread.java:745)
jchambers commented 3 years ago

Can you please provide more context? Where did that stack trace come from? Why do you think it's related to this issue? I'm not saying it isn't related, but I think I'm missing a step.

janzar commented 3 years ago

It comes from:

future = ApnsClient::sendNotification(notification);
...
future.whenComplete((response, ex) -> {
  ..
  Throwable exception = (ex instanceof ExecutionException) ? ((ExecutionException)ex).getCause() : ex;
  ...
  log.error(message, exception);
});

But looks like I hurried to write about it there. At the moment I only known, that there are lot of notifications in memory. from histogram:

13: 826777 39685296 com.eatthepath.pushy.apns.util.SimpleApnsPushNotification 26: 826777 19842648 com.eatthepath.pushy.apns.util.concurrent.PushNotificationFuture

We were unable to take full heap dump. Last time when we had problems with memory, have took heap dump from normally working server. Then all counts of objects were expected. PushNotificationFuture are mostly referenced only by ApnsClientHandler::unattachedResponsePromisesByStreamId. Need somehow to reproduce this problem...:/

janzar commented 3 years ago

Hello, still no idea what did happened, but some strange numbers from histrogram. My be you will have an idea where to look.

There are hundreds of thousands of PendingStream: 16: 738525 35449200 io.netty.handler.codec.http2.DefaultHttp2Headers 19: 738525 29541000 io.netty.handler.codec.http2.StreamBufferingEncoder$HeadersFrame 25: 738525 23632800 io.netty.handler.codec.http2.StreamBufferingEncoder$DataFrame 32: 738525 17724600 io.netty.handler.codec.http2.StreamBufferingEncoder$PendingStream

1642: 8 960 com.eatthepath.pushy.apns.TokenAuthenticationApnsClientHandler only 8 handlers. on that server we have several ApnsClient instances with 1-2 connections, that maybe not opened. and one client with 100 maximum connections for application from store. When server starts, I see that this client opens 100 connections to Apple. From chart I see, that 2021-03-12 evening there are lot of connection removes and adds on server (465 removes and 375 creates). some connection creation failures (16). Server continued to work normally (count of delivered pushes as expected) for about 24 hours. Though that period there were about 15 connection removes and adds. And then 2021-03-13 evening it stopped to work because of Full GC. Also, what is interesting, that there are lot of lambdas in ApnsChannelPool: 105: 37619 902856 com.eatthepath.pushy.apns.ApnsChannelPool$$Lambda$1408/1715923549 118: 37619 601904 com.eatthepath.pushy.apns.ApnsChannelPool$$Lambda$1417/683471411 1130: 105 2520 com.eatthepath.pushy.apns.ApnsChannelPool$$Lambda$1505/2034895318 and 105 in ApnsChannelFactory: 1129: 105 2520 com.eatthepath.pushy.apns.ApnsChannelFactory$$Lambda$1454/1030412547 As understand 105 relates to count of pendingCreateChannelFutures in ApnsChannelPool. And 37619 to pendingAcquisitionPromises.

Thanks, Z.

janzar commented 3 years ago

Hmm, maybe problem is similar to that: #853 ? Locally I was able to reproduce. Talking about count of PendingStreams, not sure, if it is easily possible, but it would be great if pushy had some configurable upper limit of pushes in the flight and optional timeout for delivery. Looks like placing semaphore before ApnsClient.sendNotification and cancelling returned futures just postpone problems with memory when something bad happens at Apple side.

jchambers commented 3 years ago

@janzar I'm having a little trouble keeping track of the main issue here. I think the main problem is that you're seeing high memory usage under poor network conditions. Is that accurate? If so, I think #854 and #858 taken together should address at least most of the issue. I'm going to close this for now, but let's reopen the issue if problems persist after those fixes get shipped.

janzar commented 3 years ago

Hello, Jon. Yes, I think it is correct. Thank you very much for help.