pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.33k stars 638 forks source link

Implement graceful shutdown on Process invocations of tools #16878

Open huonw opened 2 years ago

huonw commented 2 years ago

[Orig title: './pants fmt ::' with black leaves orphaned multiprocessing processes when retried due to filesystem changes: - Benjy]

Describe the bug

We use pants on a moderate Python mono-repo. It's large enough that ./pants fmt :: takes a bit of time to run black. If we touch a file during that time (even without making changes, just updating the mtime), black retries (yay), but seems to leave a handful of multiprocessing background workers around forever, orphaned and never exiting. These processes build up and can exhaust machine resource limits.

(This is exacerbated by #16727, since that issue means that a repo may have black running in the background constantly, and thus makes it incredibly likely that some of those invocations will be interrupted.)

Reproducer: https://gist.github.com/huonw/4718b6d146634a66fe9c5e906d76e501

git clone git@gist.github.com:4718b6d146634a66fe9c5e906d76e501.git
cd 4718b6d146634a66fe9c5e906d76e501
./run-it.sh

That script creates a whole lot of files so that black runs for a while, and then touches a file while the ./pants fmt :: invocation runs. It runs pgrep -fl pants to show the pants processes running before/after. I ran pkill -fl pants first, and the output was something like:

generating a bunch of files so that black runs for a while
18:35:21.74 [INFO] Initializing scheduler...
18:35:21.85 [INFO] Scheduler initialized.
18:35:25.08 [WARN] Completed: Format with Black - black made changes.
  files/f001.py

+ black made changes.

all clean, just pantsd
16572 pantsd [/private/tmp/4718b6d146634a66fe9c5e906d76e501] PANTS_BIN_NAME=./pants PANTS_DAEMON_ENTRYPOINT=pants.pantsd.pants_daemon:launch_new_pantsd_instance PYTHONPATH=/Users/huon/.cache/pants/setup/bootstrap-Darwin-arm64/pants.WxzanY/install/bin:/Users/huon/.pyenv/versions/3.9.10/lib/python39.zip:/Users/huon/.pyenv/versions/3.9.10/lib/python3.9:/Users/huon/.pyenv/versions/3.9.10/lib/python3.9/lib-dynload:/Users/huon/.cache/pants/setup/bootstrap-Darwin-arm64/2.14.0rc0_py39/lib/python3.9/site-packages ptr_munge= main_stack=
18:35:28.69 [WARN] Completed: Format with Black - black made changes.
  files/f482.py
  files/f483.py
...
  files/f479.py
  files/f480.py
  files/f481.py

+ black made changes.

processes hanging around
16572 pantsd [/private/tmp/4718b6d146634a66fe9c5e906d76e501] PANTS_BIN_NAME=./pants PANTS_DAEMON_ENTRYPOINT=pants.pantsd.pants_daemon:launch_new_pantsd_instance PYTHONPATH=/Users/huon/.cache/pants/setup/bootstrap-Darwin-arm64/pants.WxzanY/install/bin:/Users/huon/.pyenv/versions/3.9.10/lib/python39.zip:/Users/huon/.pyenv/versions/3.9.10/lib/python3.9:/Users/huon/.pyenv/versions/3.9.10/lib/python3.9/lib-dynload:/Users/huon/.cache/pants/setup/bootstrap-Darwin-arm64/2.14.0rc0_py39/lib/python3.9/site-packages ptr_munge= main_stack=
17420 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.resource_tracker import main;main(14)
17423 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=38) --multiprocessing-fork
17424 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.resource_tracker import main;main(14)
17425 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=43) --multiprocessing-fork
17426 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=48) --multiprocessing-fork
17427 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=53) --multiprocessing-fork
17428 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=38) --multiprocessing-fork
17429 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=43) --multiprocessing-fork

give them a bit of time to finish...
16572 pantsd [/private/tmp/4718b6d146634a66fe9c5e906d76e501] PANTS_BIN_NAME=./pants PANTS_DAEMON_ENTRYPOINT=pants.pantsd.pants_daemon:launch_new_pantsd_instance PYTHONPATH=/Users/huon/.cache/pants/setup/bootstrap-Darwin-arm64/pants.WxzanY/install/bin:/Users/huon/.pyenv/versions/3.9.10/lib/python39.zip:/Users/huon/.pyenv/versions/3.9.10/lib/python3.9:/Users/huon/.pyenv/versions/3.9.10/lib/python3.9/lib-dynload:/Users/huon/.cache/pants/setup/bootstrap-Darwin-arm64/2.14.0rc0_py39/lib/python3.9/site-packages ptr_munge= main_stack=
17420 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.resource_tracker import main;main(14)
17423 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=38) --multiprocessing-fork
17424 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.resource_tracker import main;main(14)
17425 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=43) --multiprocessing-fork
17426 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=48) --multiprocessing-fork
17427 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=53) --multiprocessing-fork
17428 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=38) --multiprocessing-fork
17429 /Users/huon/.cache/pants/named_caches/pex_root/venvs/s/ee77ea11/venv/bin/python3.9 -E -s -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=18, pipe_handle=43) --multiprocessing-fork

Note how the 3 separate lists of processes (starting with 16572 pantsd ...) change:

  1. before running anything: it's just pantsd
  2. directly after the ./pants fmt :: with retry: there's a bunch of extra processes associated with a named cache
  3. 5 seconds later: they're still there

If the touch files/f000.py is commented out, all of the process lists are the same as the first (just pantsd).

Pants version 2.14.0rc0

OS macOS

Additional info

Thanks for pants!

stuhood commented 2 years ago

Mm... sorry for the trouble.

This is because our local process sandbox doesn't use graceful shutdown for processes, and instead kills them immediately with SIGKILL.

We do actually have an implementation of graceful shutdown: https://github.com/pantsbuild/pants/blob/db07a12dd55dd5d9bda5518ac91961720582f7c6/src/rust/engine/process_execution/src/children.rs#L12-L22 ... but it is only used for interactive processes currently.

Eric-Arellano commented 2 years ago

So it sounds like we should use that code with local process sandbox cleanup, too?

stuhood commented 2 years ago

So it sounds like we should use that code with local process sandbox cleanup, too?

Yea.

mbmccoy commented 11 months ago

I've got another use case for graceful shutdown of test targets in particular. I stand up database fixtures using docker containers during test runs (using pytest). Under normal circumstances, these are cleaned up when the tests exit, but when the tests restart due to file system changes, it can leave lingering docker containers running.

Pytest naturally handles SIGINT and runs fixture cleanup by default, so sending SIGINT should resolve my issue with out any additional code changes.

An additional useful feature request would allow adjustable time between the SIGINT and SIGKILL signals. Docker containers can take a while to tear down, so configuring the timeout in the BUILD files may be necessary to make this really work for my use case.