takluyver / pynsist

Build Windows installers for Python applications
https://pynsist.readthedocs.io/
Other
882 stars 119 forks source link

Built application executables broken since distlib 0.3.4 #243

Closed bastimeyer closed 1 year ago

bastimeyer commented 2 years ago

Distlib 0.3.4 has been released in December and there have been some changes in regards to the executables which pynsist uses to build the application executables.

The result is that the executables are broken when trying to run them from a non-command-line context (eg. a GUI application) and they immediately exit with code 3221225477 (0xC0000005). This does not happen when executing them from bash or powershell, for some reason. This was working perfectly fine previously when building pynsist installers with distlib==0.3.3 installed.

Pip 21.3.1 currently bundles distlib 0.3.3 and executables built via pip when installing wheels are therefore working. I am not sure though if this will change once pip upgrades. In this case, the issue is caused by distlib itself.

This is how pynsist builds (concatenates) the wrappers: https://github.com/takluyver/pynsist/blob/master/nsist/commands.py

Original issue, in case more context is required to understand the issue: https://github.com/streamlink/streamlink/issues/4307

takluyver commented 2 years ago

I see one commit which changed the source code for the launcher executables (https://github.com/pypa/distlib/commit/e9923d5af745c68affdb2c1f45d1b3f05138e245 ) and three others which seem to have changed the binaries without changing the source code. Maybe rebuilding the code with different options? :confused:

Pynsist uses the relative interpreter path feature, which I don't think pip does, so that might expose different problems.

@vsajip any immediate ideas what change between distlib 0.3.3 and 0.3.4 might lead to the GUI exe launchers failing with exit code 0xC0000005?

vsajip commented 2 years ago

I can't think of anything specific - the only recent changes were in response to the handle duplication issue. There was quite a bit of to and fro with comments by Eryk Sun about the best way to approach it. The basic change for GUI applications was to ignore errors in handle duplication when in GUI mode. I didn't change any build options as far as I can remember, but I suppose it is possible that the executables copied into distlib were somehow the "wrong" versions. It's a manual step so an error is always possible.

As far as I know, error code 0xC0000005 is an access violation, so I'm not entirely sure how that could have been caused by the launcher changes - have you (@takluyver) been able to reproduce this error with different scripts? Or could it be something the streamlink script is doing which, together with launcher changes, is causing the issue for them?

bastimeyer commented 2 years ago

Or could it be something the streamlink script is doing which, together with launcher changes, is causing the issue for them?

The only difference between the builds was the distlib version, 0.3.4 and 0.3.3. Every other component of the build and runtime environment was identical. There's also nothing special with the build config of Streamlink's installer. You could probably concat the executables manually, similar to what pynsist does automatically.

This means that the only thing that's different is the part of the distlib binary pynsist uses to concat the application executable, which is either w32.exe or t32.exe, and the 0.3.4 versions are not working correctly.

As said, both of the 0.3.4-based executables work perfectly fine when running them from BASH or Powershell, but they don't work when they are executed from a GUI application and they terminate immediately. I have no idea why this is the case.

vsajip commented 2 years ago

Well, I wasn't suggesting that streamlink stuff changed, merely that in conjunction with the newer launcher behaviour it could be a possible cause. It needs further investigation - on the pip issue I asked for confirmation that the distlib 0.3.4 launchers resolved that issue, and that was confirmed. Of course that doesn't mean that there isn't still a remaining bug ... but

the 0.3.4-based executables work perfectly fine when running them from BASH or Powershell, but they don't work when they are executed from a GUI application and they terminate immediately

is puzzling. Can you (@bastimeyer) perhaps provide a minimal GUI application which launches a minimal GUI application with launcher which demonstrates the failure?

takluyver commented 2 years ago

Yup, it's a good idea to try an example without streamlink. The pyglet, pygame and pyqt5 examples in this repo are all fairly self contained and easy to build. I don't have a convenient Windows environment to test them, though I can build an installer for them if anyone needs that to test.

bastimeyer commented 2 years ago

I've built a repro-repo here: https://github.com/bastimeyer/distlib-bug/tree/3d0c3c8bff1580d4dcc1b4d8d56e03542612452b https://github.com/bastimeyer/distlib-bug/runs/4928128903?check_suite_focus=true#step:7:1

Should I close this issue here and open a new one on pypa/distlib?

What the reproduction repo does is

  1. download python windows embed 32+64, and extract them
  2. install distlib 0.3.3 and build simple executables similar to what pynsist does
  3. install distlib 0.3.4 and build simple executables similar to what pynsist does
  4. execute each file via a bash script and via a nodejs script

bash always works, but nodeJS doesn't when executables are built with distlib 0.3.4

vsajip commented 2 years ago

nodeJS isn't a GUI application, is it? If not, then the problem is not specific to the launcher being invoked from a GUI process?

bastimeyer commented 2 years ago

I thought it was related to GUI processes at first because I could only see the issue when launching it from my GUI application, but apparently the bug is caused by something else. The GUI is Chromium+NodeJS based, so that's why I figured trying to execute it from NodeJS alone, and it manages to reproduce it, hence my simple reproduction repo with the Github CI runner that executes the files from bash and a simple nodejs script. The detached: true option that I had set in the script is irrelevant, btw.

vsajip commented 2 years ago

Thanks for putting the repro-repo together. I will look at this when I can, but I won't be able to spend much time on it until after March 20, due to existing commitments.

vsajip commented 2 years ago

Hmmm - I can reproduce the error with spawning from Node, but not with e.g. spawning as a subprocess from Python. From what I can tell, the launcher doesn't even seem to get control when called from Node. To establish what is going on, in the launcher's main() I effectively print "Started." to the console before doing everything else as before.

I made slight changes to your build.py (so I can test without having to install distlib each time) and added another pyrun.py, so that I can test spawning the launchers from Python. I also added a file get_latest.cmd which copies debug builds of the launchers to td32.exe, wd32.exe, td64.exe and wd64.exe and builds test executables from them. My versions of those scripts are in this Gist, and also in this attached ZIP file: distlib-bug.zip which also contains the debug-build executables and the test executables created from them.

When I try to spawn a test executable using Node, I get a failure, as you do:

~\Projects\distlib-bug (master) > node run.js test-t64.exe a b c
Running via NodeJS
Exec: "test-t64.exe"
Args: ["a","b","c"]
Closed: code=3221225477, signal=null

But when I try to use Python to spawn the same test executable, all is well:

~\Projects\distlib-bug (master) > python3 pyrun.py test-t64.exe a b c
Started.
3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]
['C:\\Users\\Vinay\\Projects\\distlib-bug\\test-t64.exe', 'a', 'b', 'c']
C:\Users\Vinay\Projects\distlib-bug\python64\python.exe
Program returned 0

Notice the "Started." message, which is generated by this code in the console executables:

int main(int argc, char* argv[])
{
#if defined(LOGGING_ENABLED)
    debug_log(L"Started.");
#endif
    return process(argc, argv);
}

where I have only added the code in the #define (and, elsewhere, the debug_log() function, also in the #define).

So, it would appear that there is something about the way that Node (and your original GUI program) spawn the child process that is causing the observed behaviour. Please try spawning first the test executables in the attached ZIP, and then the ones you built, using "python pyrun.py ..." rather than "node run.js ..." and report back your results. Of course, this doesn't explain why the distlib 0.3.3 executables cause no problems. But it certainly seems like the launcher doesn't even get control when trying to spawn from Node, but it's not obvious why.

vsajip commented 2 years ago

It could be something to do with how you're opening/passing streams to the child process. With your run.js as is:

~\Projects\distlib-bug (master) > node run.js test-t64.exe a b c
Running via NodeJS
Exec: "test-t64.exe"
Args: ["a","b","c"]
Closed: code=3221225477, signal=null

as before, but if you change the spawn options:

const inst = spawn( exec, args, {
    stdio: 'ignore', // added this
    detached: false
});

then you get

~\Projects\distlib-bug (master) > node run.js test-t64.exe a b c
Running via NodeJS                                              
Exec: "test-t64.exe"                                            
Args: ["a","b","c"]                                             
Closed: code=0, signal=null                                     

So, no 0xC0000005 error. Although no output is shown, that could be because of how Node pipes it.

Note that between distlib 0.3.3 and 0,3.4, processing of file handles has changed, which could explain why 0.3.3 and 0.3.4 behave differently. However, the Python pyrun.py example shows that the launcher is working as expected in situations beyond just being invoked from Bash and Powershell.

bastimeyer commented 2 years ago

So, it would appear that there is something about the way that Node spawn the child process that is causing the observed behaviour.

Note that between distlib 0.3.3 and 0,3.4, processing of file handles has changed, which could explain why 0.3.3 and 0.3.4 behave differently. It could be something to do with how you're opening/passing streams to the child process.

Well, this is the reason why 0.3.4 stopped working, but this isn't to be blamed on Node. Their APIs and implementations haven't been changed. A change in distlib shouldn't introduce breaking changes when spawning processes via NodeJS.

You can see NodeJS's implementation (their internal JS API layer and via libuv) and how it spawns child processes here:

stdio: 'ignore'

This prevents the default stream IOs from being set up initially, otherwise it does just that and immediately reads/flushes the stream data.

vsajip commented 2 years ago

but this isn't to be blamed on Node

I'm not blaming anyone or anything, just trying to understand what might be happening. The launcher change that was made was to fix another issue. It's not entirely clear to me whether invoking the subprocess in some other way from Node might resolve your issue.

One more point: in the case where you get the 0xC0000005 error, the launcher executable isn't even started - so it's not like we can do anything in the launcher to solve that problem.

You can see NodeJS's implementation (their internal JS API layer and via libuv)

Well, I don't expect to look into that.

stdio: 'ignore'

This prevents the default stream IOs from being set up initially

Well yes, I wasn't suggesting anything other than that the 0xC0000005 error seems to be occurring within Node, since it goes away with no change to the launcher. The changed code in the launcher doesn't even get to execute, as far as I can tell, when the error you reported occurs.

Did you try with the pyrun.py script and the other bits in the .zip file? If so, what did you find?

I would experiment with using the Node APIs differently (e.g. different stdio options, sychronous variant, etc.) to try to find something that works for you.

takluyver commented 2 years ago

Looking at the NodeJS docs for spawn, it appears that Node's default options for stdio are equivalent to passing stdin=PIPE, stdout=PIPE, stderr=PIPE to subprocess in Python. It might be worth trying that in Python.

@bastimeyer's JS code also passes detached: true - maybe corresponding to creationflags=DETACHED_PROCESS? I don't know the Windows side of things well enough to know exactly what's going on. The NodeJS docs say:

On Windows, setting options.detached to true makes it possible for the child process to continue running after the parent exits. The child will have its own console window. Once enabled for a child process, it cannot be disabled.

bastimeyer commented 2 years ago

whether invoking the subprocess in some other way from Node might resolve your issue

That's what I meant with NodeJS is not to be blamed. It's not up to my or any other NodeJS app to change their implementation with how they run python app executables built with distlib's launchers. Sure, I can experiment with a different configuration/setup, but that's only for trying to figure out the exact reason why this is failing.

Did you try with the pyrun.py script and the other bits in the .zip file? If so, what did you find?

No, I did not. The pyrun script shouldn't matter here, as it's working according to what you said. And the other executables are only built with a short debug log message at the beginning, right?

I've extended the build script of my repro repo and it now builds launchers with all distlib commits from between 0.3.3 and 0.3.4, and the built launchers start to fail with distlib's commit d2d9c12 for some reason:

Log (see the build log in the step above): https://github.com/bastimeyer/distlib-bug/runs/4965789521?check_suite_focus=true#step:7:74 Diff of the build setup changes: https://github.com/bastimeyer/distlib-bug/compare/49a8770..4237198

There's a bug in the build script and it doesn't properly install via pip. Let me fix this real quick. Sorry...

Fixed it. https://github.com/bastimeyer/distlib-bug/runs/4966204374?check_suite_focus=true#step:7:586 https://github.com/bastimeyer/distlib-bug/compare/49a8770..28ef45d

The bad commit is this one: https://github.com/pypa/distlib/commit/277cc4eb2eebbca4742004d1fcc6c08e1c70c3b9 Unfortunately, binary data was committed here without source code changes.


detached: true

I already mentioned that both true and false make no difference. I set it to true initially because that's what my application does.

bastimeyer commented 2 years ago

I fixed my build script. See my edit above, please.

takluyver commented 2 years ago

Sorry, I missed that you'd already tried without setting detached. I was just figuring that if we could replicate the error with Python as the parent process instead of Node, that might help to pin down exactly what conditions make it fail.

vsajip commented 2 years ago

Can run.js be changed so that all output from the child process is immediately printed to the console? That would establish whether the "Started." message isn't output at all, or that it is output but the output is currently buffered in Node somewhere, and doesn't appear for that reason.

vsajip commented 2 years ago

it appears that Node's default options for stdio are equivalent to passing stdin=PIPE, stdout=PIPE, stderr=PIPE to subprocess in Python. It might be worth trying that in Python.

I made a small change to pyrun.py, replacing the p.wait() with

    p = subprocess.Popen(cmd, stdout=subprocess.PIPE,
                         stderr=subprocess.PIPE, stdin=subprocess.PIPE)
    stdout, stderr = p.communicate()
    print('child stdout:\n%s' % stdout.decode('utf-8'))
    if stderr:
        print('child stderr:\n%s' % stderr.decode('utf-8'))

The result is the same, i.e. no error:

~\Projects\distlib-bug (master) > python3 pyrun.py test-t64.exe a b c 
child stdout:
3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]
['C:\\Users\\Vinay\\Projects\\distlib-bug\\test-t64.exe', 'a', 'b', 'c']
C:\Users\Vinay\Projects\distlib-bug\python64\python.exe

Program returned 0
bastimeyer commented 2 years ago

Can run.js be changed so that all output from the child process is immediately printed to the console?

It already does, because the spawn event is fired upon launch success and the internal IO streams are already created via the default options.stdio config. The pipe() calls then simply drain those stream instances by piping them into the Node process's own output streams. https://nodejs.org/api/child_process.html#event-spawn

execFile can also be used, but that's just a different public API. This won't change anything. Let me create a run-script for it anyway. https://nodejs.org/api/child_process.html#child_processexecfilefile-args-options-callback


Please see the build/run log here: https://github.com/bastimeyer/distlib-bug/runs/4970446370?check_suite_focus=true#step:7:877 https://github.com/bastimeyer/distlib-bug/compare/28ef45d..4c5e665

As I said earlier and as you can see from the log, distlib's 277cc4eb2eebbca4742004d1fcc6c08e1c70c3b9 is the bad commit: https://github.com/pypa/distlib/commit/277cc4eb2eebbca4742004d1fcc6c08e1c70c3b9 but without any source code changes in this commit and the following one https://github.com/pypa/distlib/commit/7fae7d07c1b7144d1774c9b578e4f119b844ae0f or even the previous one https://github.com/pypa/distlib/commit/16e32ca69cf5382320fa62a62694426c502e0d2c nobody except you will be able to figure this out.

Why were there commits with only binary data anyway?

vsajip commented 2 years ago

but without any source code changes in this commit and the following one

Well, the launcher sources are in a separate project, you can see the commit history here:

https://bitbucket.org/vinay.sajip/simple_launcher/commits/

I could be more disciplined about copying the launcher sources over every time I generate new executables for distlib, sure. But there was a lot of to-ing and fro-ing about various options for fixing that pip issue (and the issue has a lot more detail) and I suppose it got missed. However, the sources are public in the simple_launcher project.

vsajip commented 2 years ago

[so that] all output from the child process is immediately printed to the console?

It already does

Then that suggests that the launcher executable isn't even being started, doesn't it? If it were, then "Started." should have been printed on the terminal.

[I thought I posted this comment just now, but it hasn't appeared, so I typed it in again].

bastimeyer commented 2 years ago

It looks like pip 22 has unfortunately just upgraded to distlib 0.3.4. https://github.com/pypa/pip/tree/22.0/src/pip/_vendor/distlib

carlkl commented 2 years ago

@vsajip, https://github.com/takluyver/pynsist/issues/243#issuecomment-1019996583

As far as I know, error code 0xC0000005 is an access violation, so I'm not entirely sure how that could have been caused by the launcher changes - have you (@takluyver) been able to reproduce this error with different scripts? Or could it be something the streamlink script is doing which, together with launcher changes, is causing the issue for them?

please consider trying out CREATE_BREAKAWAY_FROM_JOB or CREATE_SUSPENDED | CREATE_BREAKAWAY_FROM_JOB in CreateProcessW in launcher.c from distlib to mitigate 0xC0000005. I had exactly the same problem years ago with launchers based on https://github.com/pfmoore/shimmy and at least in my case it was helpful.

vsajip commented 1 year ago

Note that distlib 0.3.5 has been released with updated launcher executables, and pip has re-vendored, though there may not yet be a release. You might want to try with those launchers and see if this issue can be put to bed.

bastimeyer commented 1 year ago

Thanks for the hint, @vsajip. I can confirm that distlib==0.3.5 is working fine now.

bastimeyer commented 1 year ago

I should've not closed this issue yet, but since it doesn't really belong on this issue tracker here, I won't reopen it.

distlib==0.3.5 introduces a new problem and the processes run from the w{32,64}.exe executables can't be killed anymore from NodeJS. When launching the t{32,64}.exe executables, a new console window gets opened, which is bad, hence the need of the w executables.

@vsajip Do you want me to open a new issue on the distlib issue tracker with a reproduction?

vsajip commented 1 year ago

Yes, please do.