mozilla-services / autopush

Python Web Push Server used by Mozilla
https://autopush.readthedocs.io/
Mozilla Public License 2.0
215 stars 34 forks source link

bug: Add additional logging around APNS HTTP2 connectivity #1405

Closed jrconlin closed 4 years ago

jrconlin commented 4 years ago

Description

Issue #1393 notes, sending push messages across APNS seems to work great after a deploy, but then degrades after a week or so. #1394 is a possible work-around (by double-pooling connections and using a dedicated connection terminator), but it's messy and hacky.

What's really needed is a bit more visibility into what may be happening, and that will involve logging all APNS communication exceptions reliably.

Testing

No change in function, only additional logging added.

Issue(s)

Issue #1393

jdragojevic commented 4 years ago

@jrconlin when / where will we be able to see the result of these changes?

pjenvey commented 4 years ago

@jdragojevic We're aiming to deploy this next Tues. (with a small chance the following Tues depending on QA availability).

The release will add verbose APNS error logging to sentry and fixup some of our related Grafana metrics.

This should give us more insight into potential APNS failures that we don't currently have.

tublitzed commented 4 years ago

@jdragojevic - this went into the 1.56.0 release which is now live in prod as of 6/30/20 at roughly 1pm EST.

@jrconlin can you add some more detailed information about where Janet and team should be looking for the changes you've made here?

jrconlin commented 4 years ago

These errors should now be reported up to Sentry, when they start happening. Since this seems to be an error that occurs some time after deploy, i expect there to be no unusual errors until at least Jul 10. Sentry notifies us on new error types.

jrconlin commented 4 years ago

We expect that after 1.56.1 goes to prod, that the metric tagging will be fixed for autopush endpoint and that our metric reporting will improve greatly. We expect 1.56.1 to go to production in the next day or two, depending on ops load.