medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
467 stars 217 forks source link

Race condition in service worker update #7375

Closed dianabarsan closed 2 days ago

dianabarsan commented 3 years ago

Describe the bug Service worker update could happen outside of our changes listener callback: https://github.com/medic/cht-core/blob/master/webapp/src/ts/services/update-service-worker.service.ts#L24 if the browser requests the service worker resource by itself (for example, immediately after a refresh). If a service worker update is triggered before the new service worker is activated, activation is delayed. The client doesn't get a popup to reload and the new service worker is stuck in waiting.

To Reproduce Very hard to reproduce. I noticed it as an infrequent (1 in 50) failure of one of the new service worker e2e tests. To figure out what was going on, I implemented a broadcast channel between the service worker and the app, and had the service worker broadcast when it started an install, after it cached a resource, after a successful install and after activation. I also added some debug logs when the app receives the service-worker-meta change and on the state change listener in update-service-worker.service.

Logs when race condition is not triggered

"Replication started after 1 seconds since previous attempt"
GOT SW META doc change "{"id":"service-worker-meta","changes":[...],"doc":{ ... },"seq":40}"
"registration.active = [object ServiceWorker]"
"registration.installing = null"
"registration.waiting = null"
"Event: onupdatefound"
"State change: installing"
"Broadcast Received { "title":"installing" }"
"Replication succeeded after 0 seconds"
"Broadcast Received {"title":"Request for http://localhost:4988/medic/login?_sw-precache=<...> returned a response with status 200"}"
"Broadcast Received {"title":"caching"}"
"Broadcast Received {"title":"install success"}"
"Broadcast Received {"title":"skip waiting"}"
"Service worker state changed to installed!"
"Service worker state changed to activating!"
"Broadcast Received {"title":"activate"}"
"Broadcast Received {"title":"activated"}"
"New service worker activated"

Logs when race condition is triggered

"Broadcast Received {"title":"installing"}"
"Replication started after 0 seconds since previous attempt"
"Broadcast Received {"title":"Request for http://localhost:4988/medic/login?_sw-precache=afae50e46928b7da5810bc0113bc0f68 returned a response with status 200"}"
"Broadcast Received {"title":"caching"}"
"Broadcast Received {"title":"install success"}"
"Broadcast Received {"title":"skip waiting"}"
GOT SW META doc change  "{"id":"service-worker-meta","changes":[ ... ],"doc":{ ... },"seq":43}"
"Event: onupdatefound"
"registration.active = [object ServiceWorker]"
"registration.installing = null"
"registration.waiting = [object ServiceWorker]"
"Replication succeeded after 0 seconds", 
... 
SILENCE
....

When the upgrade "works", the steps are:

  1. service worker is updated API side
  2. replication
  3. got a change for sw_meta_doc
  4. we add a listener to onupdatefound that will show the modal after the service worker is activated
  5. we force a service worker update
  6. browser requests service-worker.js
  7. installs the new service worker
  8. new service worker is waiting
  9. activates the new service worker
  10. user is notified with popup

When the upgrade doesn't "work":

  1. service worker is updated API side
  2. browser requests service-worker.js (because of refresh or anything else)
  3. installs the new service woker
  4. new service worker is in waiting state
  5. replication
  6. got a change for sw_meta_doc
  7. we add a listener to onupdatefound that will show the modal after the service worker is activated
  8. we force a service worker update
  9. there is nothing to update, because we have the latest service worker version, and it's waiting to be activated.
  10. onupdatefound is never triggered
  11. client never gets the popup
  12. new service worker is waiting

For service worker lifecycle, refer to: https://developers.google.com/web/fundamentals/primers/service-workers/lifecycle#waiting The e2e test that failed waited for 15 seconds and then refreshed the page, expecting to see updated cached content, and failed because the previous content was served.

Expected behavior We should prompt the user when a new service worker is registered and activated. Otherwise, we risk the app running old code over new settings. We should probably treat the case of a new service worker already waiting separately in the UpdateServiceWorker service.

Environment

Additional context

latin-panda commented 2 years ago

Adding some reproduction steps using CHT Android, that could be related with this issue:

  1. Close the app.
  2. Make a change in CHT Core, for example, add a element with some text in the Message module's HTML.
  3. Wait until CHT Core finishes building and deploy.
  4. Open the app.
  5. A loading spiner appears and disappear fast.
  6. Manually sync.
  7. Do nothing else until sync finishes successfully
  8. Observe that after minutes (I waited like 5min) the update modal didn't appear, and the UI changes aren't visible yet.
  9. Close the app
  10. Open the app again.
  11. Observe that UI changes are visible now.

Note: Step # 1 is important to reproduce the issue, the app should be closed before changing CHT Core and open the app only after changes are deployed.

Expected behaviour: After sync finishes successfully in step 7, the update modal should appear in step 8, so user can click on "reload" and get the latest changes.

Tested in: CHT Core: 3.14.2 (running local in Mac) CHT Android: 0.11.0 Webview Phone: Samsung S10+ with Android 12.

I get the same issue when testing in: CHT Android: latest master as of 25th March 2022. Phones: Nokia 8 with Android 9 and Samsung S10+ with Android 12.

garethbowen commented 2 months ago

@dianabarsan I think we should prioritise fixing this to help stabalise the build (recent failures documented in #8936 ). Is there a quick fix like adding retries or something?

dianabarsan commented 2 months ago

I agree. I'm not sure about a quick fix, I think we have to rethink service worker upgrades somehow.