InternalServerError from APNs

bazi commented 7 years ago

We are using Pushy in our production environment and it works great! Besides we have a staging environment where tests are performed prior to launching new versions. Staging environment is not always used and most of the time it is in idle state. Today when we started to use staging, we got InternalServerError errors (stack trace given below). I guess it is related to long idle state of pushy, not sure though. We didn't have any choice but to restart the app.

In which cases APNs may return such an error? Should pushy be closing and reconnecting in such cases? Or is it a duty of our side to reconnect upon such an error?

Using pushy v0.9, java 1.8.0_102 (Oracle)

2017-01-06 05:23:27,715  WARN [nioEventLoopGroup-2-4] c.r.p.a.ApnsClient APNs server reported an internal error when sending ... .
2017-01-06 05:23:27,715 ERROR [nioEventLoopGroup-2-4] o.g.g.a.ApnsResponseListener Failed to send notification id=117_963041067
java.util.concurrent.ExecutionException: com.relayrides.pushy.apns.ApnsServerException: {"reason":"InternalServerError"}
    at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:54)
    at org.grouvi.gpns.apns.ApnsResponseListener.operationComplete(ApnsResponseListener.java:59)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:514)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:507)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:486)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:427)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:129)
    at com.relayrides.pushy.apns.ApnsClient.handleServerError(ApnsClient.java:965)
    at com.relayrides.pushy.apns.ApnsClientHandler$ApnsClientHandlerFrameAdapter.onDataRead(ApnsClientHandler.java:173)
    at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onDataRead(DefaultHttp2ConnectionDecoder.java:229)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readDataFrame(DefaultHttp2FrameReader.java:411)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:245)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:155)
    at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:118)
    at io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:335)
    at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:395)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351)
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1069)
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:902)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:574)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:488)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    at java.lang.Thread.run(Thread.java:745)
Caused by: com.relayrides.pushy.apns.ApnsServerException: {"reason":"InternalServerError"}
    ... 37 common frames omitted

jchambers commented 7 years ago

Well, an InternalServerError is exactly that: an error on Apple's side. If you can reliably cause these errors by leaving a connection idle for a long time, I'd recommend filing a bug report with Apple.

In which cases APNs may return such an error?

I think it's safe to describe these as "unexpected." When we get one of these errors, it means that Pushy succeeded in handing a message to the APNs server, but then the server replied and said "something has gone wrong on our side." We don't really know what went wrong, but it's generally safe to assume that it's a temporary thing that Apple will eventually try to fix.

Should pushy be closing and reconnecting in such cases?

I don't think so; APNs defines specific cases where the server will ask us to reconnect (by way of an HTTP/2 GOAWAY frame), and InternalServerError is not one of them. There's no guarantee that the problem will be resolved by reconnecting, or that the problem will remain for the lifetime of the connection.

In all, I think Pushy is holding up its end of the bargain here, and the problem is on Apple's side. I don't think there are any technical changes we should make here, and so I'll close this issue for now. @bazi if you think that's incorrect, please leave a comment here and we'll happily revisit the issue.

Cheers!

markusrydh commented 7 years ago

I encountered the exact same problem (against Apple's sandbox environment). I had a retry mechanism causing it to retry sending the push. It continued to fail for 48 hours before I restarted the server software. After that, there were no problems anymore. Notably, the same internal error failure was encountered for both different topics used (for two different apps). Also using Pushy 0.9, but Oracle Java 1.8.0_60.

jchambers commented 7 years ago

Huh… I still think this is an upstream problem, but @markusrydh, can you be more specific about what you did to make this happen? Can you reproduce it?

markusrydh commented 7 years ago

The (development) server was more or less idle (no push for a day), then the server sent a VoIP-push to the client. Immediately it got a InternalServerError and all pushes continued to get that until restarted.

I googled a bit and it seems that other libraries using the HTTP/2 token-based approach is getting the same problem. It seems that a connection close/reopen seems to do the trick. Although it might be a problem on the Apple side, it would be nice if there was a workaround. I am currently thinking about having to setting up a new connection if I get repeated InternalServerErrors, but of course it would be nice if the library did this behind the scenes. Or maybe a "force close" option, to make the connection immediately close and then re-open the connection (as it would for a normal broken connection). It would be a more lightweight option than disconnect().

Just a reference to some of the issues I found in node-apn that seems to be related: https://github.com/node-apn/node-apn/issues/463 https://github.com/node-apn/node-apn/issues/461 https://github.com/node-apn/node-apn/issues/449

markusrydh commented 7 years ago

And no, I have not been able to reproduce yet.

jchambers commented 7 years ago

Thanks for the links; I've shared some information there that I hope will be helpful to the node-apn community. For our part, I still think the best bet is to file a bug report with Apple. It sounds like emerging consensus around reproduction steps is:

Open a connection and send a push notification with an authentication token.
Keep the connection open, but mostly idle for more than an hour (the expiration period of an authentication token). Pushy will do this automatically by sending PING frames.
Send another notification with the initial authentication token.
The server will respond with an HTTP/500.

Seem right? If so, I'll file a bug report with Apple with the information we have so far.

jchambers commented 7 years ago

I've reported this to Apple as rdar://30168506.

bazi commented 7 years ago

@jchambers Point 2 is questionable. I suspected before that idle connections caused issues but we got the issue in our prod environment where we usually do not have that long idle times. Anyways, your steps seem reasonable to me.

jchambers commented 7 years ago

Acknowledged; will update the ticket upstream.

bazi commented 7 years ago

Another observation:

To get rid of InternalServerError exceptions we are restarting the app and when stopping we see plenty of error logs with ClientNotConnectedException (stack trace below). I am pretty sure that those exceptions are for old push notifications, not for those that were queued when jvm was terminating. According to logs, notifications were not accepted and neither rejected by APNS. So it seems listener threads hung up waiting for the future result.

I have suspicions that this may be related to the issue. Not sure though.

I have to mention that we use custom event loop group with 4 threads and all error logs were from the same thread, in stack trace below - nioEventLoopGroup-2-2

2017-01-30 08:31:24,300 ERROR [nioEventLoopGroup-2-2] o.g.g.a.ApnsResponseListener Failed to send notification id=67087_1985304421
 java.util.concurrent.ExecutionException: com.relayrides.pushy.apns.ClientNotConnectedException: Client disconnected unexpectedly.
        at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:54)
        at org.grouvi.gpns.apns.ApnsResponseListener.operationComplete(ApnsResponseListener.java:59)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:514)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:507)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:486)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:427)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:129)
        at com.relayrides.pushy.apns.ApnsClient$2$2.run(ApnsClient.java:495)
        at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
        at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:73)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:418)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:454)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.relayrides.pushy.apns.ClientNotConnectedException: Client disconnected unexpectedly.
        ... 9 common frames omitted

ksquaredkey commented 7 years ago

We recently upgraded our servers from Pushy 0.7.2 to 0.9.2 (and API tokens - hurrah!) and we just ran into this situation on our development environment over the weekend. I noticed in the node thread that someone put in a hack to reset every hour. Do folks have a workaround for the issue while we wait for Apple to fix their end?

jchambers commented 7 years ago

@bazi My apologies—I completely missed your comment from the end of January. Thanks for the additional data! I'm not sure what to make of it, but will look into it.

@ksquaredkey Not sure. I saw your post on the mailing list, and I think that's the right place to ask. I'll see what I can learn through other channels.

jchambers commented 7 years ago

@bazi I think what you saw ("notifications were not accepted and neither rejected by APNS") may be related to or the same as #435; does that seem plausible to you?

bazi commented 7 years ago

Yes, I do agree that this issue and #435 are related. But I have to note that I can do ApnsClient.disconnect() and reconnect again as opposed to mentioned issue where reporter says he can't disconnect and reconnect.

We have implemented regular reconnection mechanism which is triggered every 30 minutes or on InternalServerError, whichever comes first, with some thresholding. And this resolves many of our problems. I had concluded to myself that it is an upstream issue, which you had already stated in your replies. Btw, I really appreciate your efforts in solving such issues. Thanks for great pushy!!!

NachoSoto commented 7 years ago

We've been seeing this a lot recently as well. Any chance the reconnecting mechanism could be included in Pushy itself?

jchambers commented 7 years ago

Any chance the reconnecting mechanism could be included in Pushy itself?

As mentioned earlier, I'd really rather not if at all possible. I acknowledge that this is a real problem for real users, but I still think the right thing to do—for now at least—is to do exactly what @bazi is doing:

We have implemented regular reconnection mechanism which is triggered every 30 minutes or on InternalServerError, whichever comes first, with some thresholding.

I'm still trying as hard as I can to push for an upstream fix, which seems like The Right Thing To Do. If I get a clear sign that no fix is coming, we'll talk about adding a workaround directly to Pushy.

NachoSoto commented 7 years ago

By "upstream fix" I imagine you mean Apple?

If I get a clear sign that no fix is coming, we'll talk about adding a workaround directly to Pushy.

Makes sense. I'm not holding my breath for Apple doing anything about it though :/

jchambers commented 7 years ago

By "upstream fix" I imagine you mean Apple?

Correct. I do have reason to believe they're looking into the issue now (and maybe weren't before), so this isn't a "wait forever" situation.

NachoSoto commented 7 years ago

That's great news :)

ksquaredkey commented 7 years ago

Funny, I'm back to this issue again, myself. We just tried implementing the disconnect/connect and that does not help the issue. The ApnsClient continues to send the expired token on this new connection and Apple continues to send us the InternalServerFailure response.

The workarounds folks suggested to me on the mailing list (last month) of destroying and creating new ApnsClient instances does not work for us. Our server is Scala based and all of our infrastructure is immutable. It would be a major rewrite to support nuking the ApnsClients itself.

Digging through the source, I figured out the crux of the issue (Apple's bug notwithstanding) is the invalidation of the authentication token "token" in the AuthenticationTokenSupplier. The only way it gets invalidate (as of 0.9.3) is on the receipt and processing of an "ExpiredProviderToken" response. Changing connections will not help. Near as I can tell, Jon's recommendation in the Radar of a GOAWAY will not help. The issue is there is only one way to clear that token - the ExpiredProviderToken. (well, ok, two ways, nuke the object containing it)

So, it seems to me that we just need a mechanism to invalidate the token prior to trying an > 1 hour idle time send and let a new token be created. I did not want to get into all sorts of timeout management, etc inside pushy, and others might not care for the added complexity, so I propose:

Adding a public method to ApnsClient .invalidateAuthToken() that will call .invalidateToken() on the AuthenticationTokenSupplier instance.

Calling this will then case a new Auth Token to be created on the next Push Request send(). This would be just a minimal change to Pushy but still allow my code to be in control of working around Apple's bug.

I will probably do a PoC of this as a Fork of Pushy but was hoping for some feedback on whether folks thought this was A Really Bad Idea. I'm stuck between Apple's bug and a ship date - I cannot wait for Apple to fix their end.

Thoughts?

jchambers commented 7 years ago

This is a thoughtful and helpful question, and I feel bad for the brevity of the reply, but here it goes: I think this is fixed in #451.

jchambers commented 7 years ago

On reflection, that was way too brief. In #451, we moved token-handling into the ApnsClientHandler and out of the ApnsClient. That means that reconnecting will invalidate/re-create the authentication token.

ksquaredkey commented 7 years ago

Thanks Jon! I was looking on the 0.9.3 branch yesterday. Looking at master, I see what you are talking about - looks good. Given the level of changes that represents in master, am I correct in assuming that that code is still a while away from coming out in a new release? Just checking as I think I'm going to still have to make my local mod for now and throw a ticket on our board to remove it when the new version of Pushy comes out.

jchambers commented 7 years ago

I was actually planning on doing a release tomorrow-ish, but it may take a few days to move through the Sonatype tubes to Maven Central since we're moving to a new group ID (com.relayrides to com.turo).

ksquaredkey commented 7 years ago

oh very nice!!! You made my day (which happens to be my birthday - thanks for the present ;-)

jchambers commented 7 years ago

Folks, with apologies for the long delay, I've just received a reply from Apple requesting more information. They write:

Are you still seeing the issue mentioned in the description? If you do, would you please provide some more details with a reproduction?

Bundle ID.

Time frame + time zone of the reproduction.

The IP address of your service will be helpful.

Detail information about pushes (token, message ID, timestamp) around the reproduction time frame is also helpful.

I'd happily accept an answer to the first question ("are you still seeing this issue?") here. If you still experiencing the issue and if you're comfortable sharing bundle IDs, timestamps, IPs, and push notification details, please email me directly at jon@turo.com.

Thanks!

bazi commented 7 years ago

Well, it has been half a year that I opened this issue. Since then we had applied our custom fixes (like periodic reconnects), and there were couple of pushy releases as well. In short, we are not facing this issue anymore. I want to note that, as I see it, there were other issues mentioned here. Please provide your current status regarding your issues... @markusrydh @ksquaredkey @NachoSoto @eager

StarWindMoonCloud commented 7 years ago

I had this problem, and found this thread. @bazi could you please share any graceful way to do periodic reconnects from pushy client?

jchambers commented 7 years ago

@StarWindMoonCloud This is actually very timely. Did you experience this problem recently? With Pushy? If so, could you please email me the information requested by Apple (please see https://github.com/relayrides/pushy/issues/408#issuecomment-318142876).

Thank you!

StarWindMoonCloud commented 7 years ago

Sent @jchambers

jchambers commented 7 years ago

Thanks, @StarWindMoonCloud. I've passed the information along to Apple. We'll see what happens next.

StarWindMoonCloud commented 7 years ago

@jchambers Any update from Apple? It happened again at my side today. I'm worried Apple might be slow, since the thread started over half an year ago...

jchambers commented 7 years ago

I'm worried Apple might be slow, since the thread started over half an year ago...

…and there's the understatement of the month ;)

No reply yet, but it's still the weekend. In my experience, they usually stay focused on a problem once they've acknowledged it. Let's give them a few days, but after that, it's probably reasonable to use one of the workarounds mentioned in this thread (like closing/reopening a connection when you get an InternalServerError).

jchambers commented 7 years ago

To follow up, I have received no additional information from Apple about this issue since last month.

bazi commented 7 years ago

I had this problem, and found this thread. @bazi could you please share any graceful way to...

@StarWindMoonCloud sorry for late response. Here are steps to do a re-connection:

first of all gracefully disconnect clients by ApnsClient.disconnect().await()
build a brand new client and do connect using that client: new ApnsClientBuilder().setSigningKey( signingKey ).build()

Make sure that during re-connection phase you pause all your notification sending operations. Release all your references to disconnected client(s) and use newly created client(s).

jchambers commented 7 years ago

Now that we've moved to a pooled connection model, callers don't have direct control over connections and we may need to take additional steps to address the issue within Pushy itself. Does anybody know if this is still A Thing in practice, though?

markusrydh commented 7 years ago

I do still get InternalServerErrors from Apple using an admittedly old version of Pushy (0.9.2). Attaching a log. It is not an uncommon error, and I am using a manual disconnect/connecting with new connection to get it to work. Don't know if newer version of pushy would change this in any way? Will update to newer version, but maybe I can't use the latest version if I cannot restart connection? Pushy_internalservererror.log

jchambers commented 7 years ago

Don't know if newer version of pushy would change this in any way?

I'd say it's very unlikely. The problem has been reported across a wide variety of APNs libraries on a variety of platforms (a popular node.js APNs library had the same problem, for example), and so it appears to be beyond our control.

Will update to newer version, but maybe I can't use the latest version if I cannot restart connection?

That's an oversight on my part, and we should have a new release in a couple days that works around the problem. I'd recommend holding off until then.

jchambers commented 7 years ago

This should be fixed in #529, which will ship soon.

markusrydh commented 7 years ago

Fanstastic, will simplify our code quite a bit! Thanks Jon!

jchambers / pushy

InternalServerError from APNs #408