microsoft / playwright

Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
https://playwright.dev
Apache License 2.0
66.92k stars 3.67k forks source link

[Feature]: Graceful shutdown of webservers #33377

Open SimonSimCity opened 1 week ago

SimonSimCity commented 1 week ago

🚀 Feature Request

I want a graceful shutdown for webservers (the process should be sent a SIGINT signal).

Example

No response

Motivation

For my e2e testing, which are full system tests, I want to spin up the frontend, backend and a database. For the database I want to use a command like docker run --rm -a stdout -a stderr --name ecorize-e2e-db -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 postgres.

The current approach of killing the process has the downside that the clean-up action cannot be executed (https://docs.docker.com/reference/cli/docker/container/run/#rm).

I'm currently using concurrently to realize this. The downside there is that it's harder for me to set up different environment variables per process.

Skn0tt commented 1 week ago

Greetings from one Simon to another. This is a wonderful idea, thanks for opening the feature request! Implementing in https://github.com/microsoft/playwright/pull/33379.

mxschmitt commented 1 week ago

Short investigation: Looks like as of today we launch it in a new process group and send SIGKILL to the whole process Group. This can be reflected on that screenshot:

Image

We don't know yet why Docker doesn't shut the containers down with that signal. We'd love to make it work for Docker including docker, docker-compose and docker compose. We might need to add an opt-in flag in order to not break existing scenarios. Maybe also a new manually set default for create-playwright.

cc @Skn0tt

SimonSimCity commented 1 week ago

We don't know yet why Docker doesn't shut the containers down with that signal.

I don't know whether I understand correctly what you mean here. SIGKILL is a command which does not allow a process to do any work (there seem to be minor exceptions - https://unix.stackexchange.com/questions/5642/what-if-kill-9-does-not-work) - at least no non-kernel code.

The --rm flag on my docker command is one example of such a cleanup process which cannot run once the process or one of its parent received SIGKILL.

If you don't want to kill the process immediately but allow it to shut down gracefully, you might take a look at other kill-related signals first: https://unix.stackexchange.com/a/251267/19157

Docker for example when stopping a container first issues SIGTERM and waits for the process to exit. If it doesn't, it by default sends SIGKILL after 10 sec: https://docs.docker.com/reference/cli/docker/container/stop/

Skn0tt commented 1 week ago

Some more context from the team call for my own notes:

docker compose cleans up properly after itself when kill9'd. But when Playwright SIGKILLs it, the containers aren't cleaned up. Let's find out what's the difference there - maybe docker compose behaves differently when it's in a TTY vs child process?

Sending a SIGINT before SIGKILL is fine, but we need to make the timeout configurable.

Skn0tt commented 1 week ago

I've looked into how Docker behaves when the process is killed. Once the docker run --rm process is stopped, there two kinds of cleanup that should have happened:

  1. the spawned container shouldn't be running anymore. this is standard docker run behaviour.
  2. anonymous volumes created by the container should be removed. this should only happen because of --rm.

In my testing, i'm seeing 1. happen regardless of wether the client process is shutdown forcefully or with grace. It seems this is implemented inside the Docker daemon, so it works even though the client process isn't around after SIGKILL.

The 2. works on graceful shutdown, but not when it's forced. I'm assuming that volume removal is implemented in the client.


My takeaway from this is that we should give the webServer processes time to shut down gracefully, so that cleanup can happen even for processes without a daemon.

There's no easy way for us to detect if a process respects SIGINT though. My open PR sends a SIGINT, and if the process doesn't exit within a timeout that defaults to 500ms, it sends the usual SIGKILL.

500ms is short, but should be enough for most local dev servers I believe. For dev servers that don't respect SIGINT, it means that the shutdown is only 500ms delayed, not more.

Skn0tt commented 1 week ago

docker compose cleans up properly after itself when kill9'd. But when Playwright SIGKILLs it, the containers aren't cleaned up. Let's find out what's the difference there - maybe docker compose behaves differently when it's in a TTY vs child process?

docker compose is another special case, because it spawns a docker-compose child process under the hood. This is an implementation detail of docker compose, but it has an interesting effect:

When you kill -9 <pid> the parent process, the child process detects that it orphaned and initiates a graceful cleanup. But Playwright kills the entire process group (kill -9 -<pid>). So the child process also gets killed, and there's nothing left to perform the cleanup.

Skn0tt commented 1 week ago

To illustrate the above, try the following:

In examples/todomvc, add a webserver with docker compose up and place the following in docker-compose.yml:

services:
  db:
    image: postgres
    environment:
      POSTGRES_PASSWORD: test

Now run npx playwright test integration:19 --project chromium and check docker ps. You'll see a dangling database container. Clean it up manually with docker compose down. Now remove the - in processLauncher to kill only the top process, not the entire process group:

https://github.com/microsoft/playwright/blob/26a7dd4dd468bd15e0aa51dd5f190b7b72d8df4c/packages/playwright-core/src/utils/processLauncher.ts#L241

Repeat the test, and you'll see that the container got cleaned up correctly.

SimonSimCity commented 6 days ago

I found another resource confirming that SIGKILL is handled by the kennel, and therefore the application cannot run any code anymore (except for finishing a kernel code segment) https://unix.stackexchange.com/a/485657/21160

To the kill -9 <PID> I read that children will be sent SIGHUP (https://stackoverflow.com/a/21869303/517914). In the previous link, there's also a section saying that orphaned children are automatically adopted by PID 1.

Postgres will stop when receiving SIGHUP. https://www.postgresql.org/docs/7.0/signals.htm

I don't know by heart which process is a child of a docker compose command, but if it's the PID 1 inside the container, it would be interesting to know what happens when running a command which ignores those signals - like busybox/sh: https://stackoverflow.com/a/60537655/517914

That said, it's probably best to call a SIGTERM, on the application playwright is spawning, wait for the program to exit, and call SIGKILL including child processes if it doesn't do it in time.

SIGINT doesn't seem to be a good choice here, as it's meant for the program to interrupt the current procedure and wait for further instructions, which might not be what you want. Interactive shells might remain open. https://unix.stackexchange.com/a/251267/19157 Bash seems to stop scripts and loops on SIGINT, but seems to ignore it otherwise: https://www.gnu.org/software/bash/manual/html_node/Signals.html