0.3.5: processes of the "w" Windows launcher executables can't be killed anymore

bastimeyer commented 2 years ago

Describe the bug

After the launcher files were updated in 8d2acb51bae32a0237f2d2419b0636bce807b6f7 (0.3.5), processes spawned from the "w" launchers can't be killed anymore. I am facing this issue when trying to spawn and kill the processes from a NodeJS context. Calling childprocess.kill() via NodeJS's child_process module returns success, but the python process keeps running.

This is a major problem for me, because my NodeJS application (GUI application via NW.js / Electron) is running a python CLI application built with the distlib launcher executables, and it needs to use the "w" launchers, because otherwise a console window pops up, which is bad for the user experience.

0.3.3 is the most recent release which is working fine.

0.3.4 had the immediate 3221225477 / 0xC0000005 exit code error, and I already had to ignore this version when building standalone installers of my python application. Fortunately, pip had reverted back to distlib 0.3.3 until recently, so installing via pip was no problem either, but now it has merged the distlib 0.3.5 release, which is once again bad, and the issue needs to be fixed. Related (old 0.3.4 issue): https://github.com/takluyver/pynsist/issues/243#issuecomment-1192054549

Apparently, these are the recent launcher changes: https://bitbucket.org/vinay.sajip/simple_launcher/commits/

To Reproduce

I've created a reproduction repo (already did this for 0.3.4). Please see the build config and build+run scripts for the 0.3.5 issue: https://github.com/bastimeyer/distlib-bug/tree/53f520efec1a6173b7ca037d26c89a6dec8ca633

The resulting GH action logs: https://github.com/bastimeyer/distlib-bug/actions/runs/2718462897

0.3.5-t64 (success): https://github.com/bastimeyer/distlib-bug/runs/7467911880?check_suite_focus=true#step:7:1 0.3.5-w64 (failure): https://github.com/bastimeyer/distlib-bug/runs/7467912062?check_suite_focus=true#step:7:1

0.3.3-t64 (success): https://github.com/bastimeyer/distlib-bug/runs/7467911132?check_suite_focus=true#step:7:1 0.3.3-w64 (success): https://github.com/bastimeyer/distlib-bug/runs/7467911332?check_suite_focus=true#step:7:1

Environment

OS, including version: Windows (any)
Version of this library: 0.3.5

bastimeyer commented 2 years ago

Pinging @vsajip, since he doesn't have the repo on his watch list and probably wasn't notified. https://github.com/takluyver/pynsist/issues/243#issuecomment-1192168244

vsajip commented 2 years ago

Whoops, not sure why I wasn't watching - I've rectified that. Thanks for the tip.

vsajip commented 2 years ago

It's a bit tricky to see what's going on. With the w64.exe that is part of distlib 0.3.5, I can create two launcher test executables from the same source script - one with a shebang referencing c:/python38/python.exe and the other with a shebang referencing the interpreter from a venv created from c:/python38/python.exe. The test executable using the venv's shebang appears not to respond to a subprocess.Popen.kill(), but the other test executable does respond correctly and shuts down. In the failing case, the test executable itself is closed, but the Python interpreter it spawned isn't. For the moment, mystified :confused:

vsajip commented 2 years ago

OK ... it looks as if in the venv case, the Python interpreter is re-invoking the script using the base interpreter - I will put together a test case for this.

vsajip commented 2 years ago

Do you experience the same with your GUI executables, @cbrnr ?

cbrnr commented 2 years ago

I will check. What are the steps to reproduce this problem?

vsajip commented 2 years ago

Thanks. Just launch a GUI executable (built using a recent pip which has vendored distlib 0.3.5) using e.g. subprocess.Popen() and then try to kill it using subprocess.Popen.kill().

vsajip commented 2 years ago

I seem to have got closer to finding the cause of this problem. I have set up a test repository with some GitHub test steps. The summary is:

On Python 3.6, all launchers behave as expected.
On Python 3.7 and later, windowed executables don't terminate when subprocess.Popen.kill() is called on them.

The same launchers (from distlib 0.3.5) were used for all tests. Here is an example test run.

What changed is that from Python 3.7.2 onwards,

venv on Windows no longer copies the original binaries, but creates redirector scripts named python.exe and pythonw.exe instead.

So, it seems that the pythonw.exe redirector script isn't behaving as expected.

I'd appreciate a second opinion from @eryksun and/or @mbikovitsky, as they were involved in the recent changes to the launcher code during work to fix this pip issue. It may be that this issue needs to be raised against the CPython project itself, rather than here.

mbikovitsky commented 2 years ago

I think there are two related issues here.

The first one is in the distlib launcher:

The "windowed" version of the launcher calls clear_app_starting_state after launching the child process but before assigning it to the job object.
Inside of clear_app_starting_state there is a call to WaitForInputIdle without a timeout.
In the case of running inside a virtualenv, the child process we're waiting for is Python's venvlauncher. That's the pythonw.exe inside the virtualenv that is responsible for finding and launching the actual Python interpreter.
This launcher, AFAICT, never becomes "idle". Therefore, the WaitForInputIdle call in the parent never returns, the child is never assigned to a job, and so when the parent is killed the child lives on.

That last part is a little bit of speculation on my part, but I have some circumstantial evidence:

When the distlib launcher is killed, the child process does not belong to a job. This can be observed by launching the tests under WinDbg with windbg.exe -g -o python run_launcher_tests.py -w, which will break when the distlib launcher is terminated. Then, looking at the process tree in Process Hacker or Process Explorer, we can see that the venvlauncher is not part of a job.
When the distlib launcher is killed, it's blocked inside WaitForInputIdle. Again, this can be observed in the same way using WinDbg. Looking around the call site, this WaitForInputIdle is exactly the one inside clear_app_starting_state.
The clear_app_starting_state performs more work in order to clear the "idle" state than the similar code in the venvlauncher. In the comments there's an empirical explanation of why this is necessary. This might explain why WaitForInputIdle never returns.

If my reasoning is correct, then it will also explain why launching the Python interpreter directly, without going through the venvlauncher, works (run_launcher_tests.py -s "" -w). In that case the WaitForInputIdle returns when the Python test script shows the message box (the message box is probably enough to trigger the "idle" state).

I think the immediate fix is to move the call to AssignProcessToJobObject immediately after the process creation. Unfortunately I can't easily test this right now, so until then it's just a theory.

The other issue, perhaps more suitable for the CPython project, is the divergence between the distlib and virtualenv launchers. They perform different tasks, but face similar problems: proxying the "idle" state, forwarding standard I/O, etc. Perhaps there should be some collaboration to avoid duplication of work (and bugs).

vsajip commented 2 years ago

In that case the WaitForInputIdle returns when the Python test script shows the message box (the message box is probably enough to trigger the "idle" state).

That suggests that expected behaviour (with respect to WaitForInputIdle) depends on the specifics of the script being run. Arguably, pythonw.exe should clear the idle state reliably, independently of what script it's running. Would you agree?

The other issue, perhaps more suitable for the CPython project, is the divergence between the distlib and virtualenv launchers.

Sure, the simple_launcher was adapted from my work on the Python launcher (py.exe / pyw.exe), and this was well before venvlauncher was introduced. There has been some divergence, over the years, but normally (as you might expect for a volunteer project) some reported issue is the trigger for looking at divergences and deciding what to do about them.

mbikovitsky commented 2 years ago

Arguably, pythonw.exe should clear the idle state reliably, independently of what script it's running. Would you agree?

No, I think the current behaviour of the interpreter is correct. If the interpreter were to always clear the "idle state" on startup, it might introduce subtle bugs in existing code. For instance, consider the following setup:

Script a.py creates a message box on startup and waits for the user to click "OK".
Script b.py launches script a.py, calls WaitForInputIdle to know when it's ready to accept input, searches for the message box it created, and programmatically clicks the "OK" button.

This works right now because WaitForInputIdle returns when a.py has created the message box and is ready to accept input. If the Python interpreter were to clear the idle state itself, we would have a race condition: b.py would start examining the child process while it's still initializing.

In fact, this is the exact use case that the documentation for WaitForInputIdle describes.

That suggests that expected behaviour (with respect to WaitForInputIdle) depends on the specifics of the script being run.

Yep.

This means that both the distlib and venv launchers should proxy this state. I think the current solution in clear_app_starting_state is fine, except that maybe a timeout should be added to the wait. If WaitForInputIdle times out, don't set the idle state. And the same thing should be done in the venv launcher as well.

vsajip commented 2 years ago

I think the immediate fix is to move the call to AssignProcessToJobObject immediately after the process creation.

This worked in an initial test on my system. Good call :clap:

vsajip commented 2 years ago

it might introduce subtle bugs in existing code. For instance, consider the following setup

Good point.

vsajip commented 2 years ago

I've rebuilt all the launchers with the change that @mbikovitsky suggested, and the above commit adds them to distlib. Local tests seem to work. Please try with the updated launchers and once I receive confirmation that the change fixes the problem, I'll see about cutting a release. Reopening for now, will close when the release is done.

bastimeyer commented 2 years ago

Thanks, looks like it's working fine now. :+1:

I've added 52de225 to my reproduction repo:

0.3.5: keeps running :x: https://github.com/bastimeyer/distlib-bug/runs/7662240282?check_suite_focus=true#step:7:1
52de225: gets killed :heavy_check_mark: https://github.com/bastimeyer/distlib-bug/runs/7662240528?check_suite_focus=true#step:7:1

I've also done a production-build test and built my python application using that distlib commit and killing the process of the "...w.exe" executable now works.

eryksun commented 2 years ago

Sure, the simple_launcher was adapted from my work on the Python launcher (py.exe / pyw.exe), and this was well before venvlauncher was introduced. There has been some divergence, over the years, but normally (as you might expect for a volunteer project) some reported issue is the trigger for looking at divergences and deciding what to do about them.

Python 3.11 uses a new implementation of the py launcher that was developed by Steve Dower. It's implemented in "PC/launcher2.c". It's mostly complete, but still ironing out corner cases and niche features (e.g. the "/usr/bin/env" search is just getting implemented).

The venv launchers are still implemented in "PC/launcher.c". It would be nice if the script launcher could bypass the environment launcher. That requires coordinated development, or maybe implementing a common launcher for scripts and virtual environments.

vsajip commented 2 years ago

Python 3.11 uses a new implementation of the py launcher that was developed by Steve Dower. It's implemented in "PC/launcher2.c"

Oh, right - I had no idea. Do you have any information on the reason for the re-implementation? I've seen no discussion on discuss.python.org ...

vsajip commented 2 years ago

I just took a quick look at PC/launcher2.c - it doesn't seem to implement any of the functionality for proxying idle state, etc. I'll try to start a discussion on discuss.python.org.

vsajip commented 2 years ago

Thanks, looks like it's working fine now.

Thanks for the quick feedback. I think I will implement one more pending change - that timeout in the WaitForInputIdle that @mbikovitsky mentioned. When that's ready, I'll ask your indulgence regarding re-testing the updated launchers.

vsajip commented 2 years ago

I propose to change clear_app_starting_state() to the following:

#define INPUT_IDLE_TIMEOUT 1000

static void
clear_app_starting_state() {
    MSG msg;
    HWND hwnd;
    DWORD rc;

    PostMessageW(0, 0, 0, 0);
    GetMessageW(&msg, 0, 0, 0);
    /*
     * Proxy the child's input idle event.
     * See https://github.com/pypa/distlib/issues/175#issuecomment-1204452974
     */
    rc = WaitForInputIdle(child_process_info.hProcess, INPUT_IDLE_TIMEOUT);
    if (rc == 0) {
        /*
         * Signal the process input idle event by creating a window and pumping
         * sent messages. The window class isn't important, so just use the
         * system "STATIC" class.
         */
        hwnd = CreateWindowExW(0, L"STATIC", L"PyLauncher", 0, 0, 0, 0, 0,
                               HWND_MESSAGE, NULL, NULL, NULL);
        /* Process all sent messages and signal input idle. */
        PeekMessageW(&msg, hwnd, 0, 0, 0);
        DestroyWindow(hwnd);
    }
}

Does this seem reasonable? I tried it like this and local tests seem to work. I'm also assuming that one second is reasonable as a timeout.

eryksun commented 2 years ago

There's no need to pass a finite timeout to WaitForInputIdle(). The child could take an arbitrary amount of time before it creates a window and starts processing window messages. If the child never goes idle, then the wait will return WAIT_FAILED, and the launcher will move on to proxy the exit status. It just has to ensure that the child is assigned to the job object beforehand.

vsajip commented 2 years ago

If the child never goes idle, then the wait will return WAIT_FAILED

If the timeout is INFINITE, when does the wait return? Only when the child exits?

eryksun commented 2 years ago

If the timeout is INFINITE, when does the wait return? Only when the child exits?

If the child exits, that's the "child never goes idle" case, which returns WAIT_FAILED. Otherwise, WaitForInputIdle(hProcess, INFINITE) returns when the input idle event of the child process is signaled. In that case (i.e. rc == 0), the script launcher signals its own input idle event by creating a hidden window and peeking messages. That way anything that calls WaitForInputIdle() on the launcher is indirectly checking the input idle state of the child process.

vsajip commented 2 years ago

If no finite timeout needs to be passed to WaitForInputIdle(), then I guess the current launchers in the distlib repo are good to go, as long as there are no other adverse reports.

vsajip commented 2 years ago

I've made an additional change to the launchers, implementing the improved control key/message handling as suggested by @eryksun. Please check and confirm that the updated launchers in this commit still work in your use cases.

bastimeyer commented 2 years ago

Please check and confirm that the updated launchers in this commit still work in your use cases.

Built my python app with distlib @ 37df85a61ead2ea2dc48d0e06f7bfe2f209a982c (current master), and everything's working as expected.

vsajip commented 2 years ago

Thanks for the feedback. That's good to know - I'll aim to cut a release before too long.

eryksun commented 2 years ago

I've made an additional change to the launchers, implementing the improved control key/message handling as suggested by @eryksun. Please check and confirm that the updated launchers in this commit still work in your use cases.

Note that this change only affects console scripts. The console script launcher doesn't load "user32.dll" (e.g. it uses no shell32 functions such as SHGetFolderPathW), so when the session is ending for logoff or shutdown, the system sends it CTRL_LOGOFF_EVENT or CTRL_SHUTDOWN_EVENT. Also, when the attached console (classic or a tab in Terminal) is closing, the system sends CTRL_CLOSE_EVENT.

The GUI script launcher loads "user32.dll" (for WaitForInputIdle, CreateWindowExW, DestroyWindow, PeekMessageW, GetMessageW, and PostMessageW), so the system doesn't send it CTRL_LOGOFF_EVENT or CTRL_SHUTDOWN_EVENT. To support gracefully exiting of Python scripts at logoff or shutdown, the GUI launcher would have to create a hidden window, pump messages, and handle WM_QUERYENDSESSION and WM_ENDSESSION. When the session is ending, remove Python from the job object and exit. The need to create a hidden window means the WaitForInputIdle() call would need a reasonable timeout such as 10 seconds. After WaitForInputIdle() returns due to success or timeout, create the hidden window to watch for the session ending. This has the secondary effect of signalling the launcher's input idle event.

vsajip commented 2 years ago

I'm a little short of time at the moment. Let's see if not doing the above leads to any reported problems. If anyone wants to work up a PR for the above comment, please feel free.

pradyunsg commented 2 years ago

@vsajip Wanna close this out, now that 0.3.6 has been released?

vsajip commented 2 years ago

Sure.

pypa / distlib

0.3.5: processes of the "w" Windows launcher executables can't be killed anymore #175