Closed bazi closed 7 years ago
Well, an InternalServerError
is exactly that: an error on Apple's side. If you can reliably cause these errors by leaving a connection idle for a long time, I'd recommend filing a bug report with Apple.
In which cases APNs may return such an error?
I think it's safe to describe these as "unexpected." When we get one of these errors, it means that Pushy succeeded in handing a message to the APNs server, but then the server replied and said "something has gone wrong on our side." We don't really know what went wrong, but it's generally safe to assume that it's a temporary thing that Apple will eventually try to fix.
Should pushy be closing and reconnecting in such cases?
I don't think so; APNs defines specific cases where the server will ask us to reconnect (by way of an HTTP/2 GOAWAY
frame), and InternalServerError
is not one of them. There's no guarantee that the problem will be resolved by reconnecting, or that the problem will remain for the lifetime of the connection.
In all, I think Pushy is holding up its end of the bargain here, and the problem is on Apple's side. I don't think there are any technical changes we should make here, and so I'll close this issue for now. @bazi if you think that's incorrect, please leave a comment here and we'll happily revisit the issue.
Cheers!
I encountered the exact same problem (against Apple's sandbox environment). I had a retry mechanism causing it to retry sending the push. It continued to fail for 48 hours before I restarted the server software. After that, there were no problems anymore. Notably, the same internal error failure was encountered for both different topics used (for two different apps). Also using Pushy 0.9, but Oracle Java 1.8.0_60.
Huh… I still think this is an upstream problem, but @markusrydh, can you be more specific about what you did to make this happen? Can you reproduce it?
The (development) server was more or less idle (no push for a day), then the server sent a VoIP-push to the client. Immediately it got a InternalServerError and all pushes continued to get that until restarted.
I googled a bit and it seems that other libraries using the HTTP/2 token-based approach is getting the same problem. It seems that a connection close/reopen seems to do the trick. Although it might be a problem on the Apple side, it would be nice if there was a workaround. I am currently thinking about having to setting up a new connection if I get repeated InternalServerErrors, but of course it would be nice if the library did this behind the scenes. Or maybe a "force close" option, to make the connection immediately close and then re-open the connection (as it would for a normal broken connection). It would be a more lightweight option than disconnect().
Just a reference to some of the issues I found in node-apn that seems to be related: https://github.com/node-apn/node-apn/issues/463 https://github.com/node-apn/node-apn/issues/461 https://github.com/node-apn/node-apn/issues/449
And no, I have not been able to reproduce yet.
Thanks for the links; I've shared some information there that I hope will be helpful to the node-apn
community. For our part, I still think the best bet is to file a bug report with Apple. It sounds like emerging consensus around reproduction steps is:
PING
frames.Seem right? If so, I'll file a bug report with Apple with the information we have so far.
I've reported this to Apple as rdar://30168506.
@jchambers Point 2 is questionable. I suspected before that idle connections caused issues but we got the issue in our prod environment where we usually do not have that long idle times. Anyways, your steps seem reasonable to me.
Acknowledged; will update the ticket upstream.
Another observation:
To get rid of InternalServerError
exceptions we are restarting the app and when stopping we see plenty of error logs with ClientNotConnectedException
(stack trace below). I am pretty sure that those exceptions are for old push notifications, not for those that were queued when jvm was terminating. According to logs, notifications were not accepted and neither rejected by APNS. So it seems listener threads hung up waiting for the future result.
I have suspicions that this may be related to the issue. Not sure though.
I have to mention that we use custom event loop group with 4 threads and all error logs were from the same thread, in stack trace below - nioEventLoopGroup-2-2
2017-01-30 08:31:24,300 ERROR [nioEventLoopGroup-2-2] o.g.g.a.ApnsResponseListener Failed to send notification id=67087_1985304421
java.util.concurrent.ExecutionException: com.relayrides.pushy.apns.ClientNotConnectedException: Client disconnected unexpectedly.
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:54)
at org.grouvi.gpns.apns.ApnsResponseListener.operationComplete(ApnsResponseListener.java:59)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:514)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:486)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:427)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:129)
at com.relayrides.pushy.apns.ApnsClient$2$2.run(ApnsClient.java:495)
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:73)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:418)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:454)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.relayrides.pushy.apns.ClientNotConnectedException: Client disconnected unexpectedly.
... 9 common frames omitted
We recently upgraded our servers from Pushy 0.7.2 to 0.9.2 (and API tokens - hurrah!) and we just ran into this situation on our development environment over the weekend. I noticed in the node thread that someone put in a hack to reset every hour. Do folks have a workaround for the issue while we wait for Apple to fix their end?
@bazi My apologies—I completely missed your comment from the end of January. Thanks for the additional data! I'm not sure what to make of it, but will look into it.
@ksquaredkey Not sure. I saw your post on the mailing list, and I think that's the right place to ask. I'll see what I can learn through other channels.
@bazi I think what you saw ("notifications were not accepted and neither rejected by APNS") may be related to or the same as #435; does that seem plausible to you?
Yes, I do agree that this issue and #435 are related.
But I have to note that I can do ApnsClient.disconnect()
and reconnect again as opposed to mentioned issue where reporter says he can't disconnect and reconnect.
We have implemented regular reconnection mechanism which is triggered every 30 minutes or on InternalServerError
, whichever comes first, with some thresholding. And this resolves many of our problems.
I had concluded to myself that it is an upstream issue, which you had already stated in your replies. Btw, I really appreciate your efforts in solving such issues. Thanks for great pushy!!!
We've been seeing this a lot recently as well. Any chance the reconnecting mechanism could be included in Pushy itself?
Any chance the reconnecting mechanism could be included in Pushy itself?
As mentioned earlier, I'd really rather not if at all possible. I acknowledge that this is a real problem for real users, but I still think the right thing to do—for now at least—is to do exactly what @bazi is doing:
We have implemented regular reconnection mechanism which is triggered every 30 minutes or on
InternalServerError
, whichever comes first, with some thresholding.
I'm still trying as hard as I can to push for an upstream fix, which seems like The Right Thing To Do. If I get a clear sign that no fix is coming, we'll talk about adding a workaround directly to Pushy.
By "upstream fix" I imagine you mean Apple?
If I get a clear sign that no fix is coming, we'll talk about adding a workaround directly to Pushy.
Makes sense. I'm not holding my breath for Apple doing anything about it though :/
By "upstream fix" I imagine you mean Apple?
Correct. I do have reason to believe they're looking into the issue now (and maybe weren't before), so this isn't a "wait forever" situation.
That's great news :)
Funny, I'm back to this issue again, myself. We just tried implementing the disconnect/connect and that does not help the issue. The ApnsClient continues to send the expired token on this new connection and Apple continues to send us the InternalServerFailure response.
The workarounds folks suggested to me on the mailing list (last month) of destroying and creating new ApnsClient instances does not work for us. Our server is Scala based and all of our infrastructure is immutable. It would be a major rewrite to support nuking the ApnsClients itself.
Digging through the source, I figured out the crux of the issue (Apple's bug notwithstanding) is the invalidation of the authentication token "token" in the AuthenticationTokenSupplier. The only way it gets invalidate (as of 0.9.3) is on the receipt and processing of an "ExpiredProviderToken" response. Changing connections will not help. Near as I can tell, Jon's recommendation in the Radar of a GOAWAY will not help. The issue is there is only one way to clear that token - the ExpiredProviderToken. (well, ok, two ways, nuke the object containing it)
So, it seems to me that we just need a mechanism to invalidate the token prior to trying an > 1 hour idle time send and let a new token be created. I did not want to get into all sorts of timeout management, etc inside pushy, and others might not care for the added complexity, so I propose:
Adding a public method to ApnsClient .invalidateAuthToken() that will call .invalidateToken() on the AuthenticationTokenSupplier instance.
Calling this will then case a new Auth Token to be created on the next Push Request send(). This would be just a minimal change to Pushy but still allow my code to be in control of working around Apple's bug.
I will probably do a PoC of this as a Fork of Pushy but was hoping for some feedback on whether folks thought this was A Really Bad Idea. I'm stuck between Apple's bug and a ship date - I cannot wait for Apple to fix their end.
Thoughts?
This is a thoughtful and helpful question, and I feel bad for the brevity of the reply, but here it goes: I think this is fixed in #451.
On reflection, that was way too brief. In #451, we moved token-handling into the ApnsClientHandler
and out of the ApnsClient
. That means that reconnecting will invalidate/re-create the authentication token.
Thanks Jon! I was looking on the 0.9.3 branch yesterday. Looking at master, I see what you are talking about - looks good. Given the level of changes that represents in master, am I correct in assuming that that code is still a while away from coming out in a new release? Just checking as I think I'm going to still have to make my local mod for now and throw a ticket on our board to remove it when the new version of Pushy comes out.
I was actually planning on doing a release tomorrow-ish, but it may take a few days to move through the Sonatype tubes to Maven Central since we're moving to a new group ID (com.relayrides
to com.turo
).
oh very nice!!! You made my day (which happens to be my birthday - thanks for the present ;-)
Folks, with apologies for the long delay, I've just received a reply from Apple requesting more information. They write:
Are you still seeing the issue mentioned in the description? If you do, would you please provide some more details with a reproduction?
- Bundle ID.
- Time frame + time zone of the reproduction.
- The IP address of your service will be helpful.
- Detail information about pushes (token, message ID, timestamp) around the reproduction time frame is also helpful.
I'd happily accept an answer to the first question ("are you still seeing this issue?") here. If you still experiencing the issue and if you're comfortable sharing bundle IDs, timestamps, IPs, and push notification details, please email me directly at jon@turo.com.
Thanks!
Well, it has been half a year that I opened this issue. Since then we had applied our custom fixes (like periodic reconnects), and there were couple of pushy releases as well. In short, we are not facing this issue anymore. I want to note that, as I see it, there were other issues mentioned here. Please provide your current status regarding your issues... @markusrydh @ksquaredkey @NachoSoto @eager
I had this problem, and found this thread. @bazi could you please share any graceful way to do periodic reconnects from pushy client?
@StarWindMoonCloud This is actually very timely. Did you experience this problem recently? With Pushy? If so, could you please email me the information requested by Apple (please see https://github.com/relayrides/pushy/issues/408#issuecomment-318142876).
Thank you!
Sent @jchambers
Thanks, @StarWindMoonCloud. I've passed the information along to Apple. We'll see what happens next.
@jchambers Any update from Apple? It happened again at my side today. I'm worried Apple might be slow, since the thread started over half an year ago...
I'm worried Apple might be slow, since the thread started over half an year ago...
…and there's the understatement of the month ;)
No reply yet, but it's still the weekend. In my experience, they usually stay focused on a problem once they've acknowledged it. Let's give them a few days, but after that, it's probably reasonable to use one of the workarounds mentioned in this thread (like closing/reopening a connection when you get an InternalServerError
).
To follow up, I have received no additional information from Apple about this issue since last month.
I had this problem, and found this thread. @bazi could you please share any graceful way to...
@StarWindMoonCloud sorry for late response. Here are steps to do a re-connection:
ApnsClient.disconnect().await()
new ApnsClientBuilder().setSigningKey( signingKey ).build()
Make sure that during re-connection phase you pause all your notification sending operations. Release all your references to disconnected client(s) and use newly created client(s).
Now that we've moved to a pooled connection model, callers don't have direct control over connections and we may need to take additional steps to address the issue within Pushy itself. Does anybody know if this is still A Thing in practice, though?
I do still get InternalServerErrors from Apple using an admittedly old version of Pushy (0.9.2). Attaching a log. It is not an uncommon error, and I am using a manual disconnect/connecting with new connection to get it to work. Don't know if newer version of pushy would change this in any way? Will update to newer version, but maybe I can't use the latest version if I cannot restart connection? Pushy_internalservererror.log
Don't know if newer version of pushy would change this in any way?
I'd say it's very unlikely. The problem has been reported across a wide variety of APNs libraries on a variety of platforms (a popular node.js APNs library had the same problem, for example), and so it appears to be beyond our control.
Will update to newer version, but maybe I can't use the latest version if I cannot restart connection?
That's an oversight on my part, and we should have a new release in a couple days that works around the problem. I'd recommend holding off until then.
This should be fixed in #529, which will ship soon.
Fanstastic, will simplify our code quite a bit! Thanks Jon!
We are using Pushy in our production environment and it works great! Besides we have a staging environment where tests are performed prior to launching new versions. Staging environment is not always used and most of the time it is in idle state. Today when we started to use staging, we got
InternalServerError
errors (stack trace given below). I guess it is related to long idle state of pushy, not sure though. We didn't have any choice but to restart the app.In which cases APNs may return such an error? Should pushy be closing and reconnecting in such cases? Or is it a duty of our side to reconnect upon such an error?
Using pushy v0.9, java 1.8.0_102 (Oracle)