Open josephliccini opened 5 years ago
I think the Edge behaviour is correct here. I wouldn't expect the activate handler to run more than once.
But if the activate event did not complete, isn't it possible you will proceed with unexpected partial state? That was the reason why we chose to reinstall in firefox. I think just re-running activate event like chrome might be better to avoid using partial state with the old worker, though.
I agree with @wanderview -- the issue is the unexpected partial state.
For a concrete example, let's consider the Workbox precaching activation flow.
During installation, the assets are downloaded into a 'temp' cache.
During activation, the assets are transferred from the 'temp' cache into the 'active' cache.
A workbox service worker only serves assets from the 'active' cache.
If activation does not fully complete, but we move the worker into the active slot, some assets will be un-servable from the 'active' cache, because they did not complete moving from the 'temp' cache.
It would mean that all 'activate' handlers need to be built in a way that is compatible with a partial, non-deterministic completion point. This is possible, but needs to be clearly documented in the spec
This was discussed in #447, #659, #776. The design decision was that "fail to activate" could never happen.
I'm curious how you saw the behavior in Chrome. Chrome's intent is to implement that decision. In most cases, Chrome will not re-run the activate event handler.
BUT. We did run into one problem. During browser shutdown, Chrome closes each tab one-by-one in quick succession. We were seeing activation get triggered right at shutdown, and of course not finishing. The end result was a large % of service workers were activated who never properly finished their activate event handler, which seemed quite unfortunate.
Our hacky solution (since we don't know whether we're in shutdown or not at the layer we trigger activation in) was to delay activation by about 1 sec if it's triggered due to tab close. Then we try to activate. I think we also added an condition that if activation failed due to shutdown (we can sometimes detect it at that point), that the activation attempt is forgotten and it'd run it on the next worker startup.
The better solution is probably to detect shutdown before trying to activate, but we haven't done that.
EDIT: Most discussion was on #659.
Yes, the issue is around browser shutdown. That is where I have been seeing the unexpected partial state in Edge.
It's this behavior in Chrome I was addressing in the original issue:
I think we also added an condition that if activation failed due to shutdown (we can sometimes detect it at that point), that the activation attempt is forgotten and it'd run it on the next worker startup.
Perhaps the activate handlers are not being re-run, but I am running into this behavior you described (or activation is being properly delayed through your 1s timeout)
Maybe this behavior and the timeout should be standard? Otherwise it seems browser shutdown during activation is going to cause issues for all apps using a service worker, depending on a predictable activated state.
I think probably Chrome should learn when it's in shutdown and not run activate in that case. I'd like us to remove that hack, especially since you noticed this.
The spec could likely say something like the browser may delay running activate if it expects it will not be able to complete, e.g., due to browser shutdown.
Other than browser shutdown, not finishing activate is a bit of an edge case and we can continue with the decision in #115, #447, #659, #776, unless we have new information to overturn that.
JFYI, regarding to the 1s timeout hack, I think it now happens on all activation unfortunately for addressing https://crbug.com/879069 with taking care of the browser termination case.
@mattto I agree with you that terminated activation is an edge case. I think it's a tricky one because the browser is unable to really run code to respond to a terminated activation because the process becomes unavailable.
What is the next step for pitching this for inclusion in the spec?
👋 from the Workbox team.
We just ran into what looks to be someone running into inconsistent cache states due to Firefox failing activation (for reasons unknown) causing the service worker to re-install. Workbox currently should play nicely with Chrome's current behavior around a failed activation (re-firing the activate
event) and now that we're aware that it's an issue, we can also adapt to Firefox's current behavior (starting from scratch with a new install
event).
What I'm concerned about is that I'm not sure how we can deal with Edge's current behavior of not re-firing either activate
or install
(followed by activate
), especially since it sounds like that's the "correct" behavior that might be standardized. If Edge's behavior were implemented everywhere, would we just have to check for temporary cache entries that need to be cleaned up each time the service worker starts up?
Well backing up a bit, this issue is only about activate
and not install
. install
definitely has a notion of failing. If install
failed to finish successfully, that worker will be discarded instead of ever becoming active. activate
has no (in the spec) notion of failing. Once activate
starts, that worker is already the active worker and is committed to reaching the activated
state and start receiving events.
If Edge's behavior were implemented everywhere, would we just have to check for temporary cache entries that need to be cleaned up each time the service worker starts up?
Yes, I think that's right. The Edge behavior is implemented by Chrome too (at least, that's the intent), and is the currently specified behavior. It's only when activation was interrupted by browser shutdown that Chrome might re-run the event handler. So I think sites already needed to deal with the Edge behavior.
My plan is to teach Chrome to not start the activate handler if browser shutdown is what triggered activation (i.e., by closing the last tab that was using the incumbent worker). There is not much spec work required, except perhaps to add some text that browsers might delay activate
if they don't expect to be able to finish the event, e.g, because it's during browser shutdown.
It's only when activation was interrupted by browser shutdown that Chrome might re-run the event handler.
What if Chrome's service worker process/thread crashes during activation, due to something outside the control of the JavaScript code being executed during the activate
handler(s)? That's what I (think) I'm seeing with Firefox and in that scenario, as described above, Firefox ends up discarding the service worker and going back to install
on the next navigation.
We're talking about edge cases here, but those are the most fun to code around, and from the Workbox side of things, we'd like to try to accommodate as many failure scenarios as is viable.
What if Chrome's service worker process/thread crashes during activation, due to something outside the control of the JavaScript code being executed during the activate handler(s)?
Good question! Chrome's algorithm is roughly: 1) Browser asks the renderer to dispatch the activate event handler to the service worker. 2) Browser waits for the renderer to report back that event handler finished. This includes detecting that the renderer died or the worker thread died, if the connection is gone the browser knows the event is done. Also the timeout mechanism can forcefully abort the event handler and declare it finished. 3) Once it finished, the browser writes "activated" to disk.
As an exception in step 3 if the browser detects it's shutting down, it doesn't write "activated" to disk and will retry the next time the worker is used in some browser session.
This mostly matches the spec, except the spec doesn't account for crashing/shutdown between dispatching the activate event and running Update Worker State to activated in https://w3c.github.io/ServiceWorker/#activation-algorithm.
I'd like to change Chrome to remove the shutdown exception and instead only run the activate event if we're not in shutdown already. I could shuffle where we write "activated" to disk but either way there can be an edge case due to this cross-process model. If the browser process writes "activated" once it asks the renderer to dispatch the event, then a browser crash/shutdown occurs, the worker can become "activated" without activate ever firing from the POV of the developer. If the browser process writes "activated" after the event finishes, and a browser crash/shutdown occurs before that, the worker can get an activate event again in the next browser session.
I guess in summary there's no hard guarantees about activate due to the design that "there's no such thing as failing to activate".
All good to know. Thanks!
(On the Workbox side of things, we've got a plan in https://github.com/GoogleChrome/workbox/issues/1793 to move any substantive work out of activate
.)
Does it make sense for the spec to have a note about this?
Something like:
Note: Make sure to design your activation handlers to do only non-essential work (like cleanup). This is because activation handlers may not always run to completion, especially in cases of browser termination during activation.
There's probably a better way to say it, but just having a note like this in the spec would help developers and library authors when designing their service worker activation handlers.
That's a great idea. Agreed.
Finally got around to adding this note to the spec. Opened a PR #1451
Regarding to https://github.com/w3c/ServiceWorker/issues/1372#issuecomment-448446407, I'm wondering if we should explicitly allow retrying activate event when unexpected failure of activate event (e.g. browser shutdown, renderer crashes). So far, we don't have a safe place to manipulate a chunk of persistent data atomically if we allow activate event termination. Any JS-observable exceptions seem to be out of the scope, but more transaction-like activate event could be more useful and easy to use. What do you think?
Pre TPAC thoughts:
We could have some attribute on the event set to true if its a re-run. Might make it slightly easier for sites to handle?
Re-run with retrying
/similar attribute sounds good.
We should also consider adding something like BackgroundSync's SyncEvent's lastChance
attribute as it's unlikely an implementation would want to perform an infinite number of retries and it might help non-bad-actor ServiceWorkers report in that they're having trouble and/or have them skip inessential steps. Bad-actors would presumably just be mitigated by only re-dispatching "activate" when the ServiceWorker would be eligible for having an event dispatched at it normally and maintaining existing lifetime limits based on that.
Firefox needs to change its behavior here as part of addressing a shutdown data race that's similar to what @mfalken described in https://github.com/w3c/ServiceWorker/issues/1372#issuecomment-444719783. Is there interest in moving forward with explicitly standardizing re-run logic as well as potentially exposing retrying
and lastChance
?
For the time being Firefox/Gecko is planning to move to:
@mfalken does Chrome ever 're-run' in an observable way? As in, is the 'activate' event listener called twice for a single service worker?
Sorry for the delay. Yes I think it happens if an event is dispatched during shutdown, we will dispatch the event again on the next browser session that uses the service worker.
So is activate
callback to say:
Hi!
Across browsers, we are seeing different behaviors for a terminated activation.
For Chrome: If a worker is terminated during activation, it is placed back into the 'activating' state on next browser load. The activation handlers are re-run.
For FireFox: If a worker is terminated during activation, the worker is discarded, and re-installation is required.
For Microsoft Edge: If a worker is terminated during activation, the worker (on next browser load) is placed into the 'activated' state. The activation handlers are not re-run.
From the spec: https://w3c.github.io/ServiceWorker/#activation-algorithm
Chrome's behavior seems to be what I'd expect as a developer, but I don't think that it's well-reflected in the spec.
Thoughts?
Thanks!