timehop / apns

A Go package to interface with the Apple Push Notification Service
https://godoc.org/github.com/timehop/apns
MIT License
185 stars 47 forks source link

Handling timeouts/disconnects, dropping some pushes? #47

Open finkel opened 9 years ago

finkel commented 9 years ago

Since we switched to using this library in production, we've had many reports of dropped push notifications. I think it's because of the persistent connection, which may at times error and then disconnect. Sounds like they're trying to handle this in https://github.com/pranavraja/apns, though that looks like less mature.

Any thoughts on how to detect/avoid this, or any recommendations? Appreciate the library and the help.

Thanks, Ben

freeeve commented 9 years ago

I also recently switched, and I've noticed this as well. (at least, increased dropped notifications) I wasn't sure where to assign blame yet, as Apple doesn't guarantee delivery either.

nathany commented 9 years ago

Hi Ben,

The intent is when an error comes back from Apple, any notifications from that point on are resent. See: https://developer.apple.com/library/ios/technotes/tn2265/_index.html#//apple_ref/doc/uid/DTS40010376-CH1-TNTAG44

However, it only currently buffers 50 notifications, which may not be enough. @bdotdub is the original maintainer of the project, and I imagine he hasn't ran into this issue in production. I'm still trying to get my app ready for production, so you're further along than me. :-)

finkel commented 9 years ago

Hi Nathan,

Thanks for the response. That's true, but I'm pretty sure that there's also a scenario when Apple gets a certain kind of error that causes no response, just a silent disconnect (great API, I know), and the persistent connection used in this library doesn't handle that.

nathany commented 9 years ago

It does have a retry loop (in RunLoop). But right now it's not exposing the reason why there was an error and is retrying.

finkel commented 9 years ago

Hmm, I see that. I'm not seeing "EOF trying to write notification", or "err writing to apns" in my logs, so I wonder if there's a kind of error that doesn't break it out of the run loop, thus it gets hung in a state that drops pushes until restart.

nathany commented 9 years ago

Unfortunately all I can suggest atm is to add more logging in a fork of timehop/apns. :frowning:

If you do end up trying @pranavraja's library, it would be great to hear how it goes.

taylortrimble commented 9 years ago

I'll help look into this.

On Fri, May 15, 2015 at 12:48 PM, Nathan Youngman notifications@github.com wrote:

Unfortunately all I can suggest atm is to add more logging in a fork of timehop/apns. :frowning:

If you do end up trying @pranavraja's library, it would be great to hear how it goes.

Reply to this email directly or view it on GitHub: https://github.com/timehop/apns/issues/47#issuecomment-102507660

bdotdub commented 9 years ago

We've actually only just ran into this issue last week.

@finkel you're right re: error in the run loop. Doing some logging and debugging on our end, we've found that it gets stuck in some failure condition in these lines and can continually get an error, ie. never successfully reconnects: https://github.com/timehop/apns/blob/master/client.go#L132-L136

I can't tell if Apple is giving us a slap on the wrist for trying to connect to quickly.

We added logging in that if block in our local version and have seen a lot of EOF errors:

2015/05/04 21:37:22 apns: client connect error: EOF
2015/05/04 21:37:24 apns: client connect error: EOF
2015/05/04 21:37:26 apns: client connect error: EOF
2015/05/04 21:37:27 apns: client connect error: read tcp xx.xxxx.xxxx:2195: connection reset by peer

I'm currently at a bit of a loss as to why it's happening. In the short term, we could put a log statement in there or pass the connection error back on the channel. In the longer term, we've been talking about doing a refactor to remove the runLoop approach for exactly this scenario, where we sweep some class of errors under the rug.

@finkel @wfreeman how are you guys detecting dropped pushes?

nathany commented 9 years ago

Thanks Benny. I think it makes sense to add this logging into master in the interim.

Passing connection errors over FailedNotifs might be a bit odd right now though, since it's currently a chan of NotificationResults.

finkel commented 9 years ago

@bdotdub we've only been detecting them anecdotally (reports from users) which definitely started when we switched to this library, and of course they never repro on staging. They're very hard to log on the client. But recently I saw some very specific times when I could confirm they were being dropped, so now I'm investigated...

freeeve commented 9 years ago

For me it's just from user reports. I'm not seeing them in the FailedNotifs unless I'm doing something wrong (I increment an error count when I find those, and log the error count).

taylortrimble commented 9 years ago

I think it makes sense to add this logging into master in the interim.

I'd prefer a callback, so I could do the logging (and other custom app error recording) myself!

nathany commented 9 years ago

Reporting connection errors and read errors over a channel (FailedNotifs atm) seems like a reasonable solution as well.

I'm curious how many messages are lost?

The resend logic starts after a failed notification. What happens if the notification failed for reasons other than it was invalid? What happens if the error from Apple is Shutdown or (maybe) Processing error, or if the connection is just dropped? Does it (should it) resend the notification that "caused" the failure?

nathany commented 8 years ago

One cause of EOF looping I've seen is expired certificates.

The crypto/x509 package provides a Verify() method for certificates. If it returns an error of type x509.CertificateInvalidError, the Reason field will be set to x509.Expired if the certificate is expired. I'm not sure if it's a good idea to verify the certificates every time before connecting or just when first loaded.

Verify() will also return errors of type x509.UnknownAuthorityError though, as the Apple certificate isn't in the system roots (and I haven't figured out what needs to be added to VerifyOptions.Roots). For now I just ignore this one.