ostinelli / apnotic

A Ruby APNs HTTP/2 gem able to provide instant feedback.
MIT License
477 stars 95 forks source link

Receiving sporadic InvalidProviderToken errors from Apple #109

Open thisisjeffwong opened 2 years ago

thisisjeffwong commented 2 years ago

We have multiple APN servers sending APNs to Apple. We occasionally get a sporadic outbreak of InvalidProviderToken errors from Apple that lasts for less than an hour and recovers itself. We aren't doing anything at the application level in response to the 403 errors other than reporting the errors.

Has anyone else experienced this?

I thought that maybe one of the servers on Apple's APN cluster might have a clock skew but that would theoretically affect all of our servers equally.

benubois commented 2 years ago

Hi @thisisjeffwong,

This does sound like a clock issue. I've made some changes to insulate apnotic from system time. Would you be up for trying out the monotonic branch to see if that helps?

Out of curiosity, what do you see in production when running Time.now?

Thanks!

thisisjeffwong commented 2 years ago

I could try it on my dev APN server and see if it still works. Have you had a chance to test this code against any sends to APNS?

For testing in production, would you be looking for evidence that this fixes the problem, or evidence that this fix does no harm?

We only get this bug every few weeks but we've gone as long as 4 months without a problem. If it's a relatively safe fix and others approve, we could incorporate it and just reopen this issue if we see it again.

thisisjeffwong commented 2 years ago

@benubois For Time.now on Rails Console, I see 2021-11-03 10:55:28.748582635 -0700, which looks normal.

The time drift is a pretty good explanation for why this is only happening to one server. However, the failures seen are interspersed with successes. Could the time used for the token only be off with respect to certain servers on Apple's cluster? Or is apnotic regenerating the token once a failure is detected.

benubois commented 2 years ago

I've done some more testing and now I'm not so sure this is time related.

An expired token actually results in a 403 ExpiredProviderToken error.

However an invalid team_id, or key_id results in 403 InvalidProviderToken.

Try logging the token when you get a 403 error and make sure it includes all the required parts.

Have you had a chance to test this code against any sends to APNS?

Yes, it works. I'm running the branch in production.

For testing in production, would you be looking for evidence that this fixes the problem, or evidence that this fix does no harm?

Evidence that it fixes the problem.

thisisjeffwong commented 2 years ago

A teammate was surprised that monotonic time would fix an issue of time discrepancy between servers since servers don't share a monotonic clock.

I referred to this explanation: https://blog.dnsimple.com/2018/03/elapsed-time-with-ruby-the-right-way/

Is this close to the reasoning that motivated your change?

benubois commented 2 years ago

Yes, the monotonic change is just about preventing time related issues only on your server. I don't think this is about a time discrepancy between servers.

What's the exact length of time that the error persists?

thisisjeffwong commented 2 years ago

It lasted only 25 minutes before disappearing. Only a tiny fraction of that server's sends to APNS errored out during that time.