mozilla-services / autopush

Python Web Push Server used by Mozilla
https://autopush.readthedocs.io/
Mozilla Public License 2.0
219 stars 30 forks source link

Confirm whether uaids are being dropped on 404/410 from bridged servers. #1444

Open rfk opened 4 years ago

rfk commented 4 years ago

Based on this comment, it's my understanding that when the FCM server responds with a 404 or 410 status code, the intended behavior of the autopush server is to drop the corresponding uaid record and all its subscriptions. The logic for doing so lives in _router_fail_err here:

https://github.com/mozilla-services/autopush/blob/a459c882ec63ba5368f9c3b0648c084177b3a2ac/autopush/web/base.py#L336-L346

It's not clear whether this logic is not triggering correctly.

Based on FxA server logs, we're definitely seeing 404 and/or 410 responses when trying to send push messages to mobile clients, since FxA logs a specific "subscription expired" event in this case.

I also took a look in grafana for events of type autopush.notification.bridge.error[reason:recipient_gone], which would correspond to the FCMNotFoundError error type:

https://github.com/mozilla-services/autopush/blob/2f08e883ec0b6bee3e485a2be6587fe55fc1e025/autopush/router/fcm_v1.py#L177-L183

I am able to see a small but steady rate of such errors. So I think it's clear that such errors are in fact happening.

However...

If I look in grafana for events of type autopush.notification.bridge.error[reason:unregistered] as would be emitted alongside the drop_user call above, I do not see any events at all for platform:fcm. In fact the only instances of such an event are for platform:gcm, which may be coming from this different codepath that emits a similarly-named event.

I also believe that the current appservices push component would fail if its uaid record were to be discarded by the server, since I can't find any codepaths that would recover from such a state. But we haven't observed any devices that seem to be in such a state in the wild.

So I'm wondering if the drop_user logic linked above is working correctly, or whether it might be failing to trigger in practice. The observed behaviour of mobile push clients in the wild suggests some instances where the autopush server believes a subscription is valid but the FxA server does not, and a failure to drop subscriptions on 404/410 could explain that.

rfk commented 3 years ago

I also believe that the current appservices push component would fail if its uaid record were to be discarded by the server, since I can't find any codepaths that would recover from such a state. But we haven't observed any devices that seem to be in such a state in the wild.

Update: https://github.com/mozilla-services/autopush/issues/1445 seems to show evidence of what might be devices in such a state in the wild.

jrconlin commented 3 years ago

Looking at the autopush python code, it appears that we do not drop them.

jrconlin commented 3 years ago

It's worth noting that the newer rust version does drop these records.