mozilla-services / autopush

Python Web Push Server used by Mozilla
https://autopush.readthedocs.io/
Mozilla Public License 2.0
215 stars 34 forks source link

Re-subscribing to a chid that was recently unsubscribed gives code 410 errno 103 #1445

Open jonalmeida opened 3 years ago

jonalmeida commented 3 years ago

In Android Components, we recently landed a change that attempts to fix an expired FxA push subscription when the FxA server sets the flag subscriptionExpired.

The fix is to perform an "unsubscribe" and then a "subscribe" with the same chid. However, we're seeing logs where the subscribe fails with a code 410 errno 103 ("Expired URL endpoint") after a successful unsubscribe:

11-19 13:50:00.254 18810 18837 I ConstellationObserver: onDevicesUpdate triggered.
11-19 13:50:00.254 18810 18837 I ConstellationObserver: Proceeding to renew registration
11-19 13:50:00.254 18810 18837 I ConstellationObserver: We have been notified that our push subscription has expired; re-subscribing.
11-19 13:50:00.826 18810 18871 I ConstellationObserver: Un-subscribing successful: true
11-19 13:50:00.826 18810 18871 I ConstellationObserver: Subscribe call should give you a new endpoint.
11-19 13:50:00.827 18810 18837 I ConstellationObserver: Incrementing verifier
11-19 13:50:00.827 18810 18837 I ConstellationObserver: Verifier state before: timestamp=1605814852629, count=3
11-19 13:50:00.828 18810 18837 I ConstellationObserver: Verifier state after: timestamp=1605814852629, count=4
11-19 13:50:00.828 18810 18837 I FxaDeviceConstellation: Notifying observer about constellation updates.
11-19 13:50:00.828 18810 18837 I FxaDeviceConstellation: Executing: polling for device commands
11-19 13:50:00.947 18810 18871 W ConstellationObserver: Re-subscribing failed; FxA push events will not be received.
11-19 13:50:00.947 18810 18871 W ConstellationObserver: mozilla.appservices.push.CommunicationError: Error(Communication Error: "Unhandled client error 410 : \"{\\\"code\\\": 410, \\\"errno\\\": 103, \\\"error\\\": \\\"\\\", \\\"more_info\\\": \\\"http://autopush.readthedocs.io/en/latest/http.html#error-codes\\\", \\\"message\\\": \\\"Request did not validate UAID not found\\\"}\"")
11-19 13:50:00.947 18810 18871 W ConstellationObserver:     at mozilla.appservices.push.RustError.intoException(RustError.kt:13)
11-19 13:50:00.947 18810 18871 W ConstellationObserver:     at mozilla.appservices.push.PushManager.subscribe(PushManager.kt:13)
11-19 13:50:00.947 18810 18871 W ConstellationObserver:     at mozilla.components.feature.push.RustPushConnection.subscribe(Connection.kt:4)
11-19 13:50:00.947 18810 18871 W ConstellationObserver:     at mozilla.components.feature.push.AutoPushFeature$subscribe$$inlined$ifInitialized$lambda$2.invokeSuspend(AutoPushFeature.kt:5)
11-19 13:50:00.947 18810 18871 W ConstellationObserver:     at mozilla.components.feature.push.AutoPushFeature$subscribe$$inlined$ifInitialized$lambda$2.invoke(Unknown Source:10)
11-19 13:50:00.947 18810 18871 W ConstellationObserver:     at mozilla.components.feature.push.ext.CoroutineScopeKt$launchAndTry$2.invokeSuspend(CoroutineScope.kt:5)
11-19 13:50:00.947 18810 18871 W ConstellationObserver:     at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:3)
11-19 13:50:00.947 18810 18871 W ConstellationObserver:     at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:15)
11-19 13:50:00.947 18810 18871 W ConstellationObserver:     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
11-19 13:50:00.947 18810 18871 W ConstellationObserver:     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
11-19 13:50:00.947 18810 18871 W ConstellationObserver:     at java.lang.Thread.run(Thread.java:923)
11-19 13:50:00.994 18810 18837 I FxaDeviceConstellation: Successfully executed: polling for device commands

This is the code that is executed to show the logs above.

A breakdown of the logs:

Since we have received a result from the unsubscribe that it was successful, the subscribe call to the same chid should work as well provide us with a new endpoint.

Sentry (Nightly): https://sentry.prod.mozaws.net/operations/firefox-nightly/issues/10175063/

jonalmeida commented 3 years ago

From a user, it seems like we're hitting this in non-FxA applications as well: https://github.com/mozilla-mobile/fenix/issues/15028#issuecomment-736834347

rfk commented 3 years ago

I wonder if this is connected a little bit to the behaviour in https://github.com/mozilla-services/autopush/issues/1444, in which I wrote:

I also believe that the current appservices push component would fail if its uaid record were to be discarded by the server, since I can't find any codepaths that would recover from such a state. But we haven't observed any devices that seem to be in such a state in the wild.

Perhaps this is now evidence of "devices that seem to be in such a state in the wild", since "UAID not found" is exactly the sort of error message I'd expect for a client whose uaid record had been discarded by the server.

mhammond commented 3 years ago

FWIW, https://sql.telemetry.mozilla.org/queries/73067#183005 is showing send tab success nose-diving on Android - which prompted me to see how it works for me, and logcat shows what Ryan was getting at:

12-02 14:24:57.224 9884 9926 I ConstellationObserver: onDevicesUpdate triggered. 12-02 14:24:57.226 9884 9926 I ConstellationObserver: Proceeding to renew registration 12-02 14:24:57.226 9884 9926 I ConstellationObserver: We have been notified that our push subscription has expired; re-subscribing. ... 12-02 14:24:58.843 9884 9942 W ConstellationObserver: Un-subscribing failed; subscribe may return the same endpoint 12-02 14:24:58.843 9884 9942 W ConstellationObserver: java.lang.IllegalStateException: Un-subscribing with the native client failed. ... 12-02 14:24:59.057 9884 9942 W ConstellationObserver: Re-subscribing failed; FxA push events will not be received. 12-02 14:24:59.057 9884 9942 W ConstellationObserver: mozilla.appservices.push.CommunicationError: Error(Communication Error: "Unhandled client error 410 : \"{\\"code\\": 410, \\"errno\\": 103, \\"error\\": \\"\\", \\"more_info\\": \\"http://autopush.readthedocs.io/en/latest/http.html#error-codes\\", \\"message\\": \\"Request did not validate UAID not found\\"}\"")

jonalmeida commented 3 years ago

https://sql.telemetry.mozilla.org/queries/73067#183005 is showing send tab success nose-diving on Android

If this started happening in the last week or so, it's probably because on Android we see the subscriptionExpired flag on the FxA account and are trying to recover by unsubscribing first and then subscribing again to the same scope.

I'm guessing the send tab metrics may have been reporting it as a success but the messages never reached the device and silently dropped.

rfk commented 3 years ago

I'm guessing the send tab metrics may have been reporting it as a success

In theory, if the subscriptionExpired flag is set on the FxA server when you try to send a tab to that device, the FxA metrics will report that as a failure rather than a success.

rfk commented 3 years ago

"UAID not found" is exactly the sort of error message I'd expect for a client whose uaid record had been discarded by the server.

On the other hand, if the uaid record had been discarded by the server, I would expect the call to unsubscribe to fail; it's weird that calling unsubscribe succeeds but then calling subscribe immediately after it fails.

jrconlin commented 3 years ago

Yes, it's why I gave the bug the weight I did. There's going to be a bit of investigation to see exactly what's going on here and I'm not sure how easily I can replicate what's happened. The database appears to have gotten into a strange state.

jonalmeida commented 3 years ago

If it's helpful, we're tracking the exceptions from the unsubscribe here in nightly here: https://sentry.prod.mozaws.net/operations/firefox-nightly/issues/10175063/

rfk commented 3 years ago

On the other hand, if the uaid record had been discarded by the server, I would expect the call to unsubscribe to fail; it's weird that calling unsubscribe succeeds but then calling subscribe immediately after it fails.

The sentry breadcrumbs tell us that it is in fact the unsubscribe call that's failing, which makes more sense.

rfk commented 3 years ago

I believe it's the responsibility of the rust code to handle the "missing uaid" case better, and have filed https://github.com/mozilla/application-services/issues/3751 to follow up.

jrconlin commented 3 years ago

Doing some research on this, I am not able to reproduce it locally. One kind of dumb thought is that if the CHID isn't included and the client sends a DELETE https://.../v1/fcm/.../registration/{uaid} then the user's entire record is wiped and a 200 returned. Sending a subsequent, additional channel subscription request would fail because the record is no longer present. Not sure how that could happen, unless there's a builder function, since DELETE https://.../v1/fcm/.../registration/{uaid}/subscription/ would return a 404, but it's a thought.

In any case, I see that the number of subscription errors seems a lot lower recently.

jonalmeida commented 3 years ago

In any case, I see that the number of subscription errors seems a lot lower recently.

@jrconlin I left a comment on that Sentry exception that I'll copy here:

This exception will go back down to zero since we've backed out the change that invokes the bad state on the service side. We still need to find an appropriate fix to reintroduce the change or fix the FxA push subscription another way.

We reverted our changes that found the error during the holiday break since we didn't have a solution.

We do still have devices that are stuck in this state as well.

jrconlin commented 3 years ago

Yeah, I get that. FWIW, I was working with someone outside trying to resolve an issue and was seeing something kind of odd. I cataloged it here: https://github.com/mozilla-mobile/android-components/issues/9426 but I wonder if it might be related? Basically, i wonder if firefox might be getting an old FCM id for some weird reason, which would get rejected by FCM on the server side, leading to the 410. Kinda gets back to the whole "FCM is a weird black box" and I'd love to get away from it, but that's just not happening anytime soon.

TitanNano commented 3 years ago

I also have a device that is stuck with a 410 for every subscription. Let me know if you want to run any tests.

wurstsemmel commented 2 years ago

I am contributing here, as many related issues are closed with reference to this issue.

Samsung Galaxy S9 / Android 10 / OneUI 2.5

Please let me know if anything else is needed from my side or if I should file a new issue.

jrconlin commented 2 years ago

@wurstsemmel Thanks for this, but I think it would be better posted on the fenix repo.

This repo is the back-end service that sends the notifications to desktop and mobile devices. It would also be useful for the folk there if you could include the version of Firefox Android that you're using. ⋮ > Settings > About Firefox

Thanks!

yaakovfeldman commented 2 years ago

Is this the same bug https://github.com/mozilla-mobile/fenix/issues/19152 My device logs (which I have just attached to that issue) have the same message response has no uaid.

I am using Firebase Messaging Web library and had been repeatedly subscribing and toggling notification settings for testing, if that's relevant.

Happy to participate in any logging/debugging

mhammond commented 2 years ago

I don't think this is the same issue - if I'm reading the client code correctly, the check for a uaid happens after we've checked for a 401. A recent Fenix Nightly has some improvements here though (look for the about page listing an "AS" version of 86.2.0 or later)

jrconlin commented 2 years ago

I'd have to dig in to the code a bit. I can't think of a reason off hand that you should get a non-error registration response that doesn't have a UAID in it, unless there might be something that's causing the UAID value to be unreadable by the client (e.g. dashes can sometimes be a problem, if I recall. The code tries to return the same value format that it got, but something could be tripping it up.)