Race condition in service worker update

dianabarsan commented 3 years ago

Describe the bug Service worker update could happen outside of our changes listener callback: https://github.com/medic/cht-core/blob/master/webapp/src/ts/services/update-service-worker.service.ts#L24 if the browser requests the service worker resource by itself (for example, immediately after a refresh). If a service worker update is triggered before the new service worker is activated, activation is delayed. The client doesn't get a popup to reload and the new service worker is stuck in waiting.

To Reproduce Very hard to reproduce. I noticed it as an infrequent (1 in 50) failure of one of the new service worker e2e tests. To figure out what was going on, I implemented a broadcast channel between the service worker and the app, and had the service worker broadcast when it started an install, after it cached a resource, after a successful install and after activation. I also added some debug logs when the app receives the service-worker-meta change and on the state change listener in update-service-worker.service.

Logs when race condition is not triggered

"Replication started after 1 seconds since previous attempt"
GOT SW META doc change "{"id":"service-worker-meta","changes":[...],"doc":{ ... },"seq":40}"
"registration.active = [object ServiceWorker]"
"registration.installing = null"
"registration.waiting = null"
"Event: onupdatefound"
"State change: installing"
"Broadcast Received { "title":"installing" }"
"Replication succeeded after 0 seconds"
"Broadcast Received {"title":"Request for http://localhost:4988/medic/login?_sw-precache=<...> returned a response with status 200"}"
"Broadcast Received {"title":"caching"}"
"Broadcast Received {"title":"install success"}"
"Broadcast Received {"title":"skip waiting"}"
"Service worker state changed to installed!"
"Service worker state changed to activating!"
"Broadcast Received {"title":"activate"}"
"Broadcast Received {"title":"activated"}"
"New service worker activated"

Logs when race condition is triggered

"Broadcast Received {"title":"installing"}"
"Replication started after 0 seconds since previous attempt"
"Broadcast Received {"title":"Request for http://localhost:4988/medic/login?_sw-precache=afae50e46928b7da5810bc0113bc0f68 returned a response with status 200"}"
"Broadcast Received {"title":"caching"}"
"Broadcast Received {"title":"install success"}"
"Broadcast Received {"title":"skip waiting"}"
GOT SW META doc change  "{"id":"service-worker-meta","changes":[ ... ],"doc":{ ... },"seq":43}"
"Event: onupdatefound"
"registration.active = [object ServiceWorker]"
"registration.installing = null"
"registration.waiting = [object ServiceWorker]"
"Replication succeeded after 0 seconds", 
... 
SILENCE
....

When the upgrade "works", the steps are:

service worker is updated API side
replication
got a change for sw_meta_doc
we add a listener to onupdatefound that will show the modal after the service worker is activated
we force a service worker update
browser requests service-worker.js
installs the new service worker
new service worker is waiting
activates the new service worker
user is notified with popup

When the upgrade doesn't "work":

service worker is updated API side
browser requests service-worker.js (because of refresh or anything else)
installs the new service woker
new service worker is in waiting state
replication
got a change for sw_meta_doc
we add a listener to onupdatefound that will show the modal after the service worker is activated
we force a service worker update
there is nothing to update, because we have the latest service worker version, and it's waiting to be activated.
onupdatefound is never triggered
client never gets the popup
new service worker is waiting

For service worker lifecycle, refer to: https://developers.google.com/web/fundamentals/primers/service-workers/lifecycle#waiting The e2e test that failed waited for 15 seconds and then refreshed the page, expecting to see updated cached content, and failed because the previous content was served.

Expected behavior We should prompt the user when a new service worker is registered and activated. Otherwise, we risk the app running old code over new settings. We should probably treat the case of a new service worker already waiting separately in the UpdateServiceWorker service.

Environment

Instance: local
Browser: Chrome
App: webapp
Version: 3.14

Additional context

latin-panda commented 2 years ago

Adding some reproduction steps using CHT Android, that could be related with this issue:

Close the app.
Make a change in CHT Core, for example, add a element with some text in the Message module's HTML.
Wait until CHT Core finishes building and deploy.
Open the app.
A loading spiner appears and disappear fast.
Manually sync.
Do nothing else until sync finishes successfully
Observe that after minutes (I waited like 5min) the update modal didn't appear, and the UI changes aren't visible yet.
Close the app
Open the app again.
Observe that UI changes are visible now.

Note: Step # 1 is important to reproduce the issue, the app should be closed before changing CHT Core and open the app only after changes are deployed.

Expected behaviour: After sync finishes successfully in step 7, the update modal should appear in step 8, so user can click on "reload" and get the latest changes.

Tested in: CHT Core: 3.14.2 (running local in Mac) CHT Android: 0.11.0 Webview Phone: Samsung S10+ with Android 12.

I get the same issue when testing in: CHT Android: latest master as of 25th March 2022. Phones: Nokia 8 with Android 9 and Samsung S10+ with Android 12.

garethbowen commented 2 months ago

@dianabarsan I think we should prioritise fixing this to help stabalise the build (recent failures documented in #8936 ). Is there a quick fix like adding retries or something?

dianabarsan commented 2 months ago

I agree. I'm not sure about a quick fix, I think we have to rethink service worker upgrades somehow.

medic / cht-core

Race condition in service worker update #7375