Clarification on what happens during a terminated activation

josephliccini commented 5 years ago

Hi!

Across browsers, we are seeing different behaviors for a terminated activation.

For Chrome: If a worker is terminated during activation, it is placed back into the 'activating' state on next browser load. The activation handlers are re-run.

For FireFox: If a worker is terminated during activation, the worker is discarded, and re-installation is required.

For Microsoft Edge: If a worker is terminated during activation, the worker (on next browser load) is placed into the 'activated' state. The activation handlers are not re-run.

From the spec: https://w3c.github.io/ServiceWorker/#activation-algorithm

Note: Once an active worker is activating, neither a runtime script error nor a force termination of the active worker prevents the active worker from getting activated.

Chrome's behavior seems to be what I'd expect as a developer, but I don't think that it's well-reflected in the spec.

Thoughts?

Thanks!

jakearchibald commented 5 years ago

I think the Edge behaviour is correct here. I wouldn't expect the activate handler to run more than once.

wanderview commented 5 years ago

But if the activate event did not complete, isn't it possible you will proceed with unexpected partial state? That was the reason why we chose to reinstall in firefox. I think just re-running activate event like chrome might be better to avoid using partial state with the old worker, though.

josephliccini commented 5 years ago

I agree with @wanderview -- the issue is the unexpected partial state.

For a concrete example, let's consider the Workbox precaching activation flow.

During installation, the assets are downloaded into a 'temp' cache.

During activation, the assets are transferred from the 'temp' cache into the 'active' cache.

A workbox service worker only serves assets from the 'active' cache.

If activation does not fully complete, but we move the worker into the active slot, some assets will be un-servable from the 'active' cache, because they did not complete moving from the 'temp' cache.

It would mean that all 'activate' handlers need to be built in a way that is compatible with a partial, non-deterministic completion point. This is possible, but needs to be clearly documented in the spec

mfalken commented 5 years ago

This was discussed in #447, #659, #776. The design decision was that "fail to activate" could never happen.

I'm curious how you saw the behavior in Chrome. Chrome's intent is to implement that decision. In most cases, Chrome will not re-run the activate event handler.

BUT. We did run into one problem. During browser shutdown, Chrome closes each tab one-by-one in quick succession. We were seeing activation get triggered right at shutdown, and of course not finishing. The end result was a large % of service workers were activated who never properly finished their activate event handler, which seemed quite unfortunate.

Our hacky solution (since we don't know whether we're in shutdown or not at the layer we trigger activation in) was to delay activation by about 1 sec if it's triggered due to tab close. Then we try to activate. I think we also added an condition that if activation failed due to shutdown (we can sometimes detect it at that point), that the activation attempt is forgotten and it'd run it on the next worker startup.

The better solution is probably to detect shutdown before trying to activate, but we haven't done that.

EDIT: Most discussion was on #659.

josephliccini commented 5 years ago

Yes, the issue is around browser shutdown. That is where I have been seeing the unexpected partial state in Edge.

It's this behavior in Chrome I was addressing in the original issue:

I think we also added an condition that if activation failed due to shutdown (we can sometimes detect it at that point), that the activation attempt is forgotten and it'd run it on the next worker startup.

Perhaps the activate handlers are not being re-run, but I am running into this behavior you described (or activation is being properly delayed through your 1s timeout)

Maybe this behavior and the timeout should be standard? Otherwise it seems browser shutdown during activation is going to cause issues for all apps using a service worker, depending on a predictable activated state.

mfalken commented 5 years ago

I think probably Chrome should learn when it's in shutdown and not run activate in that case. I'd like us to remove that hack, especially since you noticed this.

The spec could likely say something like the browser may delay running activate if it expects it will not be able to complete, e.g., due to browser shutdown.

Other than browser shutdown, not finishing activate is a bit of an edge case and we can continue with the decision in #115, #447, #659, #776, unless we have new information to overturn that.

makotoshimazu commented 5 years ago

JFYI, regarding to the 1s timeout hack, I think it now happens on all activation unfortunately for addressing https://crbug.com/879069 with taking care of the browser termination case.

josephliccini commented 5 years ago

@mattto I agree with you that terminated activation is an edge case. I think it's a tricky one because the browser is unable to really run code to respond to a terminated activation because the process becomes unavailable.

What is the next step for pitching this for inclusion in the spec?

jeffposnick commented 5 years ago

👋 from the Workbox team.

We just ran into what looks to be someone running into inconsistent cache states due to Firefox failing activation (for reasons unknown) causing the service worker to re-install. Workbox currently should play nicely with Chrome's current behavior around a failed activation (re-firing the activate event) and now that we're aware that it's an issue, we can also adapt to Firefox's current behavior (starting from scratch with a new install event).

What I'm concerned about is that I'm not sure how we can deal with Edge's current behavior of not re-firing either activate or install (followed by activate), especially since it sounds like that's the "correct" behavior that might be standardized. If Edge's behavior were implemented everywhere, would we just have to check for temporary cache entries that need to be cleaned up each time the service worker starts up?

mfalken commented 5 years ago

Well backing up a bit, this issue is only about activate and not install. install definitely has a notion of failing. If install failed to finish successfully, that worker will be discarded instead of ever becoming active. activate has no (in the spec) notion of failing. Once activate starts, that worker is already the active worker and is committed to reaching the activated state and start receiving events.

If Edge's behavior were implemented everywhere, would we just have to check for temporary cache entries that need to be cleaned up each time the service worker starts up?

Yes, I think that's right. The Edge behavior is implemented by Chrome too (at least, that's the intent), and is the currently specified behavior. It's only when activation was interrupted by browser shutdown that Chrome might re-run the event handler. So I think sites already needed to deal with the Edge behavior.

My plan is to teach Chrome to not start the activate handler if browser shutdown is what triggered activation (i.e., by closing the last tab that was using the incumbent worker). There is not much spec work required, except perhaps to add some text that browsers might delay activate if they don't expect to be able to finish the event, e.g, because it's during browser shutdown.

jeffposnick commented 5 years ago

It's only when activation was interrupted by browser shutdown that Chrome might re-run the event handler.

What if Chrome's service worker process/thread crashes during activation, due to something outside the control of the JavaScript code being executed during the activate handler(s)? That's what I (think) I'm seeing with Firefox and in that scenario, as described above, Firefox ends up discarding the service worker and going back to install on the next navigation.

We're talking about edge cases here, but those are the most fun to code around, and from the Workbox side of things, we'd like to try to accommodate as many failure scenarios as is viable.

mfalken commented 5 years ago

What if Chrome's service worker process/thread crashes during activation, due to something outside the control of the JavaScript code being executed during the activate handler(s)?

Good question! Chrome's algorithm is roughly: 1) Browser asks the renderer to dispatch the activate event handler to the service worker. 2) Browser waits for the renderer to report back that event handler finished. This includes detecting that the renderer died or the worker thread died, if the connection is gone the browser knows the event is done. Also the timeout mechanism can forcefully abort the event handler and declare it finished. 3) Once it finished, the browser writes "activated" to disk.

As an exception in step 3 if the browser detects it's shutting down, it doesn't write "activated" to disk and will retry the next time the worker is used in some browser session.

This mostly matches the spec, except the spec doesn't account for crashing/shutdown between dispatching the activate event and running Update Worker State to activated in https://w3c.github.io/ServiceWorker/#activation-algorithm.

I'd like to change Chrome to remove the shutdown exception and instead only run the activate event if we're not in shutdown already. I could shuffle where we write "activated" to disk but either way there can be an edge case due to this cross-process model. If the browser process writes "activated" once it asks the renderer to dispatch the event, then a browser crash/shutdown occurs, the worker can become "activated" without activate ever firing from the POV of the developer. If the browser process writes "activated" after the event finishes, and a browser crash/shutdown occurs before that, the worker can get an activate event again in the next browser session.

mfalken commented 5 years ago

I guess in summary there's no hard guarantees about activate due to the design that "there's no such thing as failing to activate".

jeffposnick commented 5 years ago

All good to know. Thanks!

(On the Workbox side of things, we've got a plan in https://github.com/GoogleChrome/workbox/issues/1793 to move any substantive work out of activate.)

josephliccini commented 5 years ago

Does it make sense for the spec to have a note about this?

Something like:

Note: Make sure to design your activation handlers to do only non-essential work (like cleanup). This is because activation handlers may not always run to completion, especially in cases of browser termination during activation.

There's probably a better way to say it, but just having a note like this in the spec would help developers and library authors when designing their service worker activation handlers.

mfalken commented 5 years ago

That's a great idea. Agreed.

josephliccini commented 5 years ago

Finally got around to adding this note to the spec. Opened a PR #1451

makotoshimazu commented 5 years ago

Regarding to https://github.com/w3c/ServiceWorker/issues/1372#issuecomment-448446407, I'm wondering if we should explicitly allow retrying activate event when unexpected failure of activate event (e.g. browser shutdown, renderer crashes). So far, we don't have a safe place to manipulate a chunk of persistent data atomically if we allow activate event termination. Any JS-observable exceptions seem to be out of the scope, but more transaction-like activate event could be more useful and easy to use. What do you think?

jakearchibald commented 5 years ago

Pre TPAC thoughts:

We need to decide what behaviour we want here.
We decided that activate rejection shouldn't roll back to the old service worker, since caches have now been updated.
However, if activate doesn't run to completion, maybe we should rerun it?

wanderview commented 5 years ago

We could have some attribute on the event set to true if its a re-run. Might make it slightly easier for sites to handle?

asutherland commented 5 years ago

Re-run with retrying/similar attribute sounds good.

We should also consider adding something like BackgroundSync's SyncEvent's lastChance attribute as it's unlikely an implementation would want to perform an infinite number of retries and it might help non-bad-actor ServiceWorkers report in that they're having trouble and/or have them skip inessential steps. Bad-actors would presumably just be mitigated by only re-dispatching "activate" when the ServiceWorker would be eligible for having an event dispatched at it normally and maintaining existing lifetime limits based on that.

asutherland commented 3 years ago

Firefox needs to change its behavior here as part of addressing a shutdown data race that's similar to what @mfalken described in https://github.com/w3c/ServiceWorker/issues/1372#issuecomment-444719783. Is there interest in moving forward with explicitly standardizing re-run logic as well as potentially exposing retrying and lastChance?

asutherland commented 3 years ago

For the time being Firefox/Gecko is planning to move to:

Inhibit the "activate" event during shutdown, leaving the ServiceWorker in what amounts to its "activating" state where (relevant) functional events are delayed until the ServiceWorker becomes active.
The next time a functional event is dispatched at the ServiceWorker, the "activate" will be dispatched. As per the above, the (relevant) functional events will be enqueued until activation completes.
- The general intent here is to conform to the behavior discussed around background fetch where a ServiceWorker will not be woken up if there isn't a controlled document involved.
- For the quite common pinned-tab use-case this means that activation will happen when the browser next starts.
No retries at this time, although I'll probably structure the metadata so this is possible. (Since the choice is to have the activation happen at startup and thereby block page load, it would be undesirable to open the door to the page becoming permanently broken due to a very long-running activation that keeps being interrupted by the user restarting the browser in the hopes it will make the page load.)

jakearchibald commented 3 years ago

@mfalken does Chrome ever 're-run' in an observable way? As in, is the 'activate' event listener called twice for a single service worker?

mfalken commented 3 years ago

Sorry for the delay. Yes I think it happens if an event is dispatched during shutdown, we will dispatch the event again on the next browser session that uses the service worker.

mangelozzi commented 2 years ago

So is activate callback to say:

The service worker is now activated, is there any code you want to run? If it raises an error then theres no consequence to the service worker still being active. (This is what I think the thread is leaning towards).
Or is it.. Now is your change to activate the service worker, if its fails the service worker will be discarded?

w3c / ServiceWorker

Clarification on what happens during a terminated activation #1372