Open rfk opened 4 years ago
I also believe that the current appservices push component would fail if its uaid record were to be discarded by the server, since I can't find any codepaths that would recover from such a state. But we haven't observed any devices that seem to be in such a state in the wild.
Update: https://github.com/mozilla-services/autopush/issues/1445 seems to show evidence of what might be devices in such a state in the wild.
Looking at the autopush python code, it appears that we do not drop them.
It's worth noting that the newer rust version does drop these records.
Based on this comment, it's my understanding that when the FCM server responds with a
404
or410
status code, the intended behavior of the autopush server is to drop the corresponding uaid record and all its subscriptions. The logic for doing so lives in_router_fail_err
here:https://github.com/mozilla-services/autopush/blob/a459c882ec63ba5368f9c3b0648c084177b3a2ac/autopush/web/base.py#L336-L346
It's not clear whether this logic is not triggering correctly.
Based on FxA server logs, we're definitely seeing
404
and/or410
responses when trying to send push messages to mobile clients, since FxA logs a specific "subscription expired" event in this case.I also took a look in grafana for events of type
autopush.notification.bridge.error[reason:recipient_gone]
, which would correspond to theFCMNotFoundError
error type:https://github.com/mozilla-services/autopush/blob/2f08e883ec0b6bee3e485a2be6587fe55fc1e025/autopush/router/fcm_v1.py#L177-L183
I am able to see a small but steady rate of such errors. So I think it's clear that such errors are in fact happening.
However...
If I look in grafana for events of type
autopush.notification.bridge.error[reason:unregistered]
as would be emitted alongside thedrop_user
call above, I do not see any events at all forplatform:fcm
. In fact the only instances of such an event are forplatform:gcm
, which may be coming from this different codepath that emits a similarly-named event.I also believe that the current appservices push component would fail if its uaid record were to be discarded by the server, since I can't find any codepaths that would recover from such a state. But we haven't observed any devices that seem to be in such a state in the wild.
So I'm wondering if the
drop_user
logic linked above is working correctly, or whether it might be failing to trigger in practice. The observed behaviour of mobile push clients in the wild suggests some instances where the autopush server believes a subscription is valid but the FxA server does not, and a failure to drop subscriptions on404
/410
could explain that.