Define behaviour of commands in terms of unexpected navigations, content process crashes etc.

gsnedders commented 5 years ago

Currently, various implementations have pretty useless behaviour in face of some errors (e.g., both GeckoDriver and ChromeDriver seem to return null for execute script if the content process crashes). We should better define what implementations should do, and probably define something distinguishable from potentially correct behaviour (given null from that and null from return null; are the same!).

There's a few cases here:

An implementation like ChromeDriver where the WebDriver end point communicates over another IPC channel to the browser itself (over CDP in that case), where there are multiple possible failure modes:
- the CDP connection being closed (and being unable to reconnect)
- the browser stopping responding to CDP requests (and being unable to reconnect and remake the request, if it's idempotent!)
- the browser responding to CDP requests but the content process not responding to the browser ("Aw, snap!")
An implementation where the browser is itself the end point (I'm not sure anyone does this any more?):
- the end point stops responding because the browser itself has crashed
- the content process isn't responding
An intermediary node losing its connection to the end point.

These are all general connectivity problems where the local end can still communicate with the first remote end, so we should define something here, as we still have a working WebDriver connection.

AutomatedTester commented 5 years ago

So there are a few issues we need to be aware of, which can be overcome by the remote end.

What happens if the "comms" thread crashes but the browser is still alive?
What happens when the main thread dies taking out the entire browser?

For the most part this leads to us making sure the process the end point is managing is alive or dead and giving a meaningful message. Ideally we don't want to define process management because that feels wrong (though I can't articulate why clearly so can be swayed).

This work would also probably need to be done for each of the commands as a crash can happen at any point.

Is this something that we could leave undefined for webdriver but focus on the work in WPT et al. for now and see if there is commonality after that work?

gsnedders commented 5 years ago

Ideally we don't want to define process management because that feels wrong (though I can't articulate why clearly so can be swayed).

Strongly agree. I don't think we want to tie anything to actual processes or anything. The failure modes I think we really want to expose, at most, are:

This window is unresponsive/not alive.
This session is unresponsive/not alive.

I think the specification should be clear about behaviour in all of the cases in the original post, but probably not by actually exhaustively listing all of them.

jugglinmike commented 5 years ago

Geckodriver 0.24.0 now returns "unknown error" if the crash occurs during Execute Async Script. Chromedriver 73.0.3683.68 still returns "success" with data null. I'll include a Bash script below in case that's helpful for researching this.

Is this something that we could leave undefined for webdriver but focus on the work in WPT et al. for now and see if there is commonality after that work?

Here's a workaround for one specific case of this in WPT's tooling:

https://github.com/web-platform-tests/wpt/pull/16157

`webdriver-crash.sh`

```bash #!/bin/bash set -e port=2323 webdriver_bin=$1 if [ ! -x "${webdriver_bin}" ]; then echo Usage: $0 PATH_TO_WEBDRIVER_EXECUTABLE >&2 exit 1 fi $webdriver_bin --port=${port} > /dev/null & webdriver_pid=$! function cleanup { if [ -n "${session_id}" ]; then curl \ -X DELETE \ --silent \ http://localhost:${port}/session/${session_id} > /dev/null fi if [ -n "${webdriver_pid}" ]; then kill -9 ${webdriver_pid} fi } trap cleanup EXIT sleep 1 session_id=$( curl \ --data '{"desiredCapabilities":{}}' \ --silent \ http://localhost:${port}/session | \ python -m json.tool | \ grep sessionId | \ sed -e 's/^.*://' -e 's/.*"$[^"]\+$".*/\1/' ) curl \ --verbose \ --data '{"script": "", "args": []}' \ http://localhost:${port}/session/${session_id}/execute/async & browser_pid=$(ps aux | \ grep -E 'chrom.*enable-automatio[n]|firefox.*marionett[e]' | \ grep -v renderer | \ awk '{print $2}') echo Browser process running with identifier $browser_pid. if kill -9 $browser_pid; then echo Process successfully terminated. else echo Unable to terminate process. fi ```

whimboo commented 3 years ago

As @AutomatedTester suggested on Matrix this issue might fit with our recent discussion at TPAC around unexpected navigations during command execution. See the full minutes here:

https://www.w3.org/2020/10/27-webdriver-minutes.html#t05

@shs96c given that you wanted to follow-up, do you think that's right? If yes, we can slightly modify the summary of this issue.

whimboo commented 3 years ago

I would like to see some action on this particular issue given that it somewhat interferes us when doing our site-isolation work. As discussed in the TPAC meeting a couple of commands should survive such a situation, and I assume that these are those commands that do not directly interact with the DOM like commands related to cookies, contexts (except switch to (parent) frame), timeouts, and session.

For commands that error out (like for actions) would we need a new error type, or are we trying to use a matching existent error if available, eg. for element interaction commands no such element? And if no such error exists we would fallback to a new one, eg. navigation/crash during Execute Script.

Note that with a page load strategy of None nearly any command that interacts with the DOM could actually fail when called right after a navigation has been initiated. The exact time when the DOM of the current page gets swapped with one of the target page is unknown and racy. In case of Marionette/geckodriver we can perfectly see it failing when waiting for an element, and an implicit timeout set.

Before defining the behavior for all the commands maybe we could get clarification first on the basic (error) handling as described above? I'm happy to contribute the required wdspec tests.

CC'ing @jgraham, @mjzffr, @burg, @JohnChen0

w3c / webdriver

Define behaviour of commands in terms of unexpected navigations, content process crashes etc. #1308