Closed bastimeyer closed 2 years ago
Pinging @vsajip, since he doesn't have the repo on his watch list and probably wasn't notified. https://github.com/takluyver/pynsist/issues/243#issuecomment-1192168244
Whoops, not sure why I wasn't watching - I've rectified that. Thanks for the tip.
It's a bit tricky to see what's going on. With the w64.exe
that is part of distlib
0.3.5, I can create two launcher test executables from the same source script - one with a shebang referencing c:/python38/python.exe
and the other with a shebang referencing the interpreter from a venv created from c:/python38/python.exe
. The test executable using the venv's shebang appears not to respond to a subprocess.Popen.kill()
, but the other test executable does respond correctly and shuts down. In the failing case, the test executable itself is closed, but the Python interpreter it spawned isn't. For the moment, mystified :confused:
OK ... it looks as if in the venv case, the Python interpreter is re-invoking the script using the base interpreter - I will put together a test case for this.
Do you experience the same with your GUI executables, @cbrnr ?
I will check. What are the steps to reproduce this problem?
Thanks. Just launch a GUI executable (built using a recent pip
which has vendored distlib
0.3.5) using e.g. subprocess.Popen()
and then try to kill it using subprocess.Popen.kill()
.
I seem to have got closer to finding the cause of this problem. I have set up a test repository with some GitHub test steps. The summary is:
subprocess.Popen.kill()
is called on them.The same launchers (from distlib
0.3.5) were used for all tests. Here is an example test run.
What changed is that from Python 3.7.2 onwards,
venv
on Windows no longer copies the original binaries, but creates redirector scripts named python.exe and pythonw.exe instead.
So, it seems that the pythonw.exe
redirector script isn't behaving as expected.
I'd appreciate a second opinion from @eryksun and/or @mbikovitsky, as they were involved in the recent changes to the launcher code during work to fix this pip
issue. It may be that this issue needs to be raised against the CPython project itself, rather than here.
I think there are two related issues here.
The first one is in the distlib
launcher:
clear_app_starting_state
after launching the child process but before assigning it to the job object.clear_app_starting_state
there is a call to WaitForInputIdle
without a timeout.pythonw.exe
inside the virtualenv that is responsible for finding and launching the actual Python interpreter.WaitForInputIdle
call in the parent never returns, the child is never assigned to a job, and so when the parent is killed the child lives on.That last part is a little bit of speculation on my part, but I have some circumstantial evidence:
distlib
launcher is killed, the child process does not belong to a job. This can be observed by launching the tests under WinDbg with windbg.exe -g -o python run_launcher_tests.py -w
, which will break when the distlib
launcher is terminated. Then, looking at the process tree in Process Hacker or Process Explorer, we can see that the venvlauncher is not part of a job.distlib
launcher is killed, it's blocked inside WaitForInputIdle
. Again, this can be observed in the same way using WinDbg. Looking around the call site, this WaitForInputIdle
is exactly the one inside clear_app_starting_state
.clear_app_starting_state
performs more work in order to clear the "idle" state than the similar code in the venvlauncher. In the comments there's an empirical explanation of why this is necessary. This might explain why WaitForInputIdle
never returns.If my reasoning is correct, then it will also explain why launching the Python interpreter directly, without going through the venvlauncher, works (run_launcher_tests.py -s "" -w
). In that case the WaitForInputIdle
returns when the Python test script shows the message box (the message box is probably enough to trigger the "idle" state).
I think the immediate fix is to move the call to AssignProcessToJobObject
immediately after the process creation. Unfortunately I can't easily test this right now, so until then it's just a theory.
The other issue, perhaps more suitable for the CPython project, is the divergence between the distlib
and virtualenv launchers. They perform different tasks, but face similar problems: proxying the "idle" state, forwarding standard I/O, etc. Perhaps there should be some collaboration to avoid duplication of work (and bugs).
In that case the
WaitForInputIdle
returns when the Python test script shows the message box (the message box is probably enough to trigger the "idle" state).
That suggests that expected behaviour (with respect to WaitForInputIdle
) depends on the specifics of the script being run. Arguably, pythonw.exe
should clear the idle state reliably, independently of what script it's running. Would you agree?
The other issue, perhaps more suitable for the CPython project, is the divergence between the distlib and virtualenv launchers.
Sure, the simple_launcher
was adapted from my work on the Python launcher (py.exe
/ pyw.exe
), and this was well before venvlauncher
was introduced. There has been some divergence, over the years, but normally (as you might expect for a volunteer project) some reported issue is the trigger for looking at divergences and deciding what to do about them.
Arguably,
pythonw.exe
should clear the idle state reliably, independently of what script it's running. Would you agree?
No, I think the current behaviour of the interpreter is correct. If the interpreter were to always clear the "idle state" on startup, it might introduce subtle bugs in existing code. For instance, consider the following setup:
a.py
creates a message box on startup and waits for the user to click "OK".b.py
launches script a.py
, calls WaitForInputIdle
to know when it's ready to accept input, searches for the message box it created, and programmatically clicks the "OK" button.This works right now because WaitForInputIdle
returns when a.py
has created the message box and is ready to accept input. If the Python interpreter were to clear the idle state itself, we would have a race condition: b.py
would start examining the child process while it's still initializing.
In fact, this is the exact use case that the documentation for WaitForInputIdle
describes.
That suggests that expected behaviour (with respect to
WaitForInputIdle
) depends on the specifics of the script being run.
Yep.
This means that both the distlib
and venv launchers should proxy this state. I think the current solution in clear_app_starting_state
is fine, except that maybe a timeout should be added to the wait. If WaitForInputIdle
times out, don't set the idle state. And the same thing should be done in the venv launcher as well.
I think the immediate fix is to move the call to
AssignProcessToJobObject
immediately after the process creation.
This worked in an initial test on my system. Good call :clap:
it might introduce subtle bugs in existing code. For instance, consider the following setup
Good point.
I've rebuilt all the launchers with the change that @mbikovitsky suggested, and the above commit adds them to distlib
. Local tests seem to work. Please try with the updated launchers and once I receive confirmation that the change fixes the problem, I'll see about cutting a release. Reopening for now, will close when the release is done.
Thanks, looks like it's working fine now. :+1:
I've added 52de225
to my reproduction repo:
0.3.5
: keeps running :x:
https://github.com/bastimeyer/distlib-bug/runs/7662240282?check_suite_focus=true#step:7:152de225
: gets killed :heavy_check_mark:
https://github.com/bastimeyer/distlib-bug/runs/7662240528?check_suite_focus=true#step:7:1I've also done a production-build test and built my python application using that distlib commit and killing the process of the "...w.exe" executable now works.
Sure, the
simple_launcher
was adapted from my work on the Python launcher (py.exe
/pyw.exe
), and this was well beforevenvlauncher
was introduced. There has been some divergence, over the years, but normally (as you might expect for a volunteer project) some reported issue is the trigger for looking at divergences and deciding what to do about them.
Python 3.11 uses a new implementation of the py launcher that was developed by Steve Dower. It's implemented in "PC/launcher2.c". It's mostly complete, but still ironing out corner cases and niche features (e.g. the "/usr/bin/env" search is just getting implemented).
The venv launchers are still implemented in "PC/launcher.c". It would be nice if the script launcher could bypass the environment launcher. That requires coordinated development, or maybe implementing a common launcher for scripts and virtual environments.
Python 3.11 uses a new implementation of the py launcher that was developed by Steve Dower. It's implemented in "PC/launcher2.c"
Oh, right - I had no idea. Do you have any information on the reason for the re-implementation? I've seen no discussion on discuss.python.org
...
I just took a quick look at PC/launcher2.c
- it doesn't seem to implement any of the functionality for proxying idle state, etc. I'll try to start a discussion on discuss.python.org
.
Thanks, looks like it's working fine now.
Thanks for the quick feedback. I think I will implement one more pending change - that timeout in the WaitForInputIdle
that @mbikovitsky mentioned. When that's ready, I'll ask your indulgence regarding re-testing the updated launchers.
I propose to change clear_app_starting_state()
to the following:
#define INPUT_IDLE_TIMEOUT 1000
static void
clear_app_starting_state() {
MSG msg;
HWND hwnd;
DWORD rc;
PostMessageW(0, 0, 0, 0);
GetMessageW(&msg, 0, 0, 0);
/*
* Proxy the child's input idle event.
* See https://github.com/pypa/distlib/issues/175#issuecomment-1204452974
*/
rc = WaitForInputIdle(child_process_info.hProcess, INPUT_IDLE_TIMEOUT);
if (rc == 0) {
/*
* Signal the process input idle event by creating a window and pumping
* sent messages. The window class isn't important, so just use the
* system "STATIC" class.
*/
hwnd = CreateWindowExW(0, L"STATIC", L"PyLauncher", 0, 0, 0, 0, 0,
HWND_MESSAGE, NULL, NULL, NULL);
/* Process all sent messages and signal input idle. */
PeekMessageW(&msg, hwnd, 0, 0, 0);
DestroyWindow(hwnd);
}
}
Does this seem reasonable? I tried it like this and local tests seem to work. I'm also assuming that one second is reasonable as a timeout.
There's no need to pass a finite timeout to WaitForInputIdle()
. The child could take an arbitrary amount of time before it creates a window and starts processing window messages. If the child never goes idle, then the wait will return WAIT_FAILED
, and the launcher will move on to proxy the exit status. It just has to ensure that the child is assigned to the job object beforehand.
If the child never goes idle, then the wait will return
WAIT_FAILED
If the timeout is INFINITE
, when does the wait return? Only when the child exits?
If the timeout is
INFINITE
, when does the wait return? Only when the child exits?
If the child exits, that's the "child never goes idle" case, which returns WAIT_FAILED
. Otherwise, WaitForInputIdle(hProcess, INFINITE)
returns when the input idle event of the child process is signaled. In that case (i.e. rc == 0
), the script launcher signals its own input idle event by creating a hidden window and peeking messages. That way anything that calls WaitForInputIdle()
on the launcher is indirectly checking the input idle state of the child process.
If no finite timeout needs to be passed to WaitForInputIdle()
, then I guess the current launchers in the distlib
repo are good to go, as long as there are no other adverse reports.
I've made an additional change to the launchers, implementing the improved control key/message handling as suggested by @eryksun. Please check and confirm that the updated launchers in this commit still work in your use cases.
Please check and confirm that the updated launchers in this commit still work in your use cases.
Built my python app with distlib @ 37df85a61ead2ea2dc48d0e06f7bfe2f209a982c (current master), and everything's working as expected.
Thanks for the feedback. That's good to know - I'll aim to cut a release before too long.
I've made an additional change to the launchers, implementing the improved control key/message handling as suggested by @eryksun. Please check and confirm that the updated launchers in this commit still work in your use cases.
Note that this change only affects console scripts. The console script launcher doesn't load "user32.dll" (e.g. it uses no shell32 functions such as SHGetFolderPathW
), so when the session is ending for logoff or shutdown, the system sends it CTRL_LOGOFF_EVENT
or CTRL_SHUTDOWN_EVENT
. Also, when the attached console (classic or a tab in Terminal) is closing, the system sends CTRL_CLOSE_EVENT
.
The GUI script launcher loads "user32.dll" (for WaitForInputIdle
, CreateWindowExW
, DestroyWindow
, PeekMessageW
, GetMessageW
, and PostMessageW
), so the system doesn't send it CTRL_LOGOFF_EVENT
or CTRL_SHUTDOWN_EVENT
. To support gracefully exiting of Python scripts at logoff or shutdown, the GUI launcher would have to create a hidden window, pump messages, and handle WM_QUERYENDSESSION
and WM_ENDSESSION
. When the session is ending, remove Python from the job object and exit. The need to create a hidden window means the WaitForInputIdle()
call would need a reasonable timeout such as 10 seconds. After WaitForInputIdle()
returns due to success or timeout, create the hidden window to watch for the session ending. This has the secondary effect of signalling the launcher's input idle event.
I'm a little short of time at the moment. Let's see if not doing the above leads to any reported problems. If anyone wants to work up a PR for the above comment, please feel free.
@vsajip Wanna close this out, now that 0.3.6 has been released?
Sure.
Describe the bug
After the launcher files were updated in 8d2acb51bae32a0237f2d2419b0636bce807b6f7 (0.3.5), processes spawned from the "w" launchers can't be killed anymore. I am facing this issue when trying to spawn and kill the processes from a NodeJS context. Calling
childprocess.kill()
via NodeJS'schild_process
module returns success, but the python process keeps running.This is a major problem for me, because my NodeJS application (GUI application via NW.js / Electron) is running a python CLI application built with the distlib launcher executables, and it needs to use the "w" launchers, because otherwise a console window pops up, which is bad for the user experience.
0.3.3 is the most recent release which is working fine.
0.3.4 had the immediate
3221225477
/0xC0000005
exit code error, and I already had to ignore this version when building standalone installers of my python application. Fortunately, pip had reverted back to distlib 0.3.3 until recently, so installing via pip was no problem either, but now it has merged the distlib 0.3.5 release, which is once again bad, and the issue needs to be fixed. Related (old 0.3.4 issue): https://github.com/takluyver/pynsist/issues/243#issuecomment-1192054549Apparently, these are the recent launcher changes: https://bitbucket.org/vinay.sajip/simple_launcher/commits/
To Reproduce
I've created a reproduction repo (already did this for 0.3.4). Please see the build config and build+run scripts for the 0.3.5 issue: https://github.com/bastimeyer/distlib-bug/tree/53f520efec1a6173b7ca037d26c89a6dec8ca633
The resulting GH action logs: https://github.com/bastimeyer/distlib-bug/actions/runs/2718462897
0.3.5-t64 (success): https://github.com/bastimeyer/distlib-bug/runs/7467911880?check_suite_focus=true#step:7:1 0.3.5-w64 (failure): https://github.com/bastimeyer/distlib-bug/runs/7467912062?check_suite_focus=true#step:7:1
0.3.3-t64 (success): https://github.com/bastimeyer/distlib-bug/runs/7467911132?check_suite_focus=true#step:7:1 0.3.3-w64 (success): https://github.com/bastimeyer/distlib-bug/runs/7467911332?check_suite_focus=true#step:7:1
Environment