python / cpython

The Python programming language
https://www.python.org
Other
63.14k stars 30.23k forks source link

The platform module can cause crashes in Windows due to slow WMI calls #125315

Open runn opened 1 week ago

runn commented 1 week ago

Crash report

What happened?

When running on a virtual machine where WMI calls seems to have very variable performance the WMI C module can cause Python to crash.

If you have (or can simulate slow WMI calls) then simple python code should (non-deterministically) reproduce the problem. The reason it's non-deterministic is because It's a thread race with a shared resource on the stack. WMI thread in CPython can end up with a pointer to a now invalid stack frame.

I can cause the problem by repeatedly calling platform.machine() and platform.win32_ver() in a loop of about 100 iterations on a machine with slow WMI calls.

import platform

for i in range(100):
    platform.win32_ver()
    platform.machine()

On the affected machines this will sometimes cause the whole process to die with error that relates to the stack being trashed, such as 0xC0000409 where the stack canary has been overwritten.

From a crash dump (that i cannot share) I debugged this issue by taking the WMI module and running on its own. I notice in the code that there is a timeout that seems to have been created because the WMI calls themselves can be quite slow, especially in the case of permission problems where WMIs own timeout is quite long.

https://github.com/python/cpython/blob/main/PC/_wmimodule.cpp#L282

The problem is that this timeout can cause the function that the platform modules uses to finish before the thread running the WMI code does. This is a bit of a problem because the thread is using a pointer to a struct that is allocated on the stack that is about to go away.

https://github.com/python/cpython/blob/main/PC/_wmimodule.cpp#L241

That struct has handles to a bunch of things that the WMI thread wants to use or clean up, including references to the ends of a pipe for which WriteFile calls are used.

In some situations python hangs, sometimes windows terminates it because it detected a stack overflow, sometimes it works, sometimes the timeout is fine, but it all depends on where the thread doing the WMI work was at the time the calling function terminates.

I can stop this problem by monkey patching the WMI calls in the platform module (it has alternative code paths that work ok). I can also stop it by removing the simulated timeout in the WMI module.

The problem is that lots of tools use the platform module - i first discovered this whilst using poetry when poetry install would just terminate, but it can affect anything that uses the platform module to make WMI calls on a machine with slow WMI.

There is no reasonable workaround on the virtual machines I use because they are managed by an organisation (as is the python install on those machines).

CPython versions tested on:

3.12

Operating systems tested on:

Windows

Output from running 'python -VV' on the command line:

No response

runn commented 1 week ago

I believe this is the commit that adds the timeout that exposes the problem with the stack allocated struct, though i think there might be a few code paths in here where problems will happen if _wmi_exec_query_impl exits before the thread is finished

https://github.com/python/cpython/commit/5a0137ca34deb6e1e2e890a52cb4b22d645c166b

related to this issue https://github.com/python/cpython/pull/112658

zooba commented 4 days ago

Acknowledged. We should fix this, I haven't had a chance to look into it yet. At a guess, we need to copy something from that shared struct into a local in the thread before we start work.

runn commented 4 days ago

Thanks!

I think the query string pointer could be dealt with (roughly) by just allocating the BSTR before COM init happens. At least there's a thread local copy quite quickly that'll likely beat the race.

If you copy the handle refs local to the thread there's a chance the calling thread will close them first then CloseHandle might cause InvalidHandle to be raised. Maybe no massive issue outside of structured exception handling? DuplicateHandle is a pain to use here.

In some ways I worry about using WMI. I know it's the canonical location for things like windows version information but WMI has caused me all sorts of trouble over the years so I'm a little cautious. Occasionally some file gets corrupted and WMI stops working. Occasionally it's very slow. It's dependent on permission, and the timeout there is long (and not configurable) - which i think caused the original waitforsingleobject calls with a timeouts to be added.

WMI is just a very complex beast under the hood i suppose: https://learn.microsoft.com/en-us/windows/win32/wmisdk/wmi-architecture

The fact that we basically stop using the WMI code in 3.13 the first time a timeout happens sort of adds to the argument that we should consider avoiding it altogether.

For various reasons I cannot contribute a PR even though i'd love to, so please accept my apologies for doing complaining and no contributing.

zooba commented 1 day ago

allocating the BSTR before COM init happens

Unfortunately, I'm pretty sure this is cheating 😆 COM init is needed to initialise the memory allocator used to allocate the BSTR.

The fact that we basically stop using the WMI code in 3.13 the first time a timeout happens sort of adds to the argument that we should consider avoiding it altogether.

If you know of another way to accurately get the OS version for display (not compatibility) purposes (bearing in mind that we've tried all the other obvious and non-obvious ones), I'd love to hear about it. The next best one seems to be to run cmd.exe and parse its ver command, which also breaks in similar ways, probably on more correctly configured systems than WMI will break on. Due to how Windows is built these days, we can't read the version info from any system files anymore.

There supposedly is not meant to be any permission restriction on reading the basic OS info for the current machine. Granted, plenty of other info requires more permissions, and I guess it's possible to forcibly override those sensible defaults. I personally haven't found a broken machine - all the timeouts were contributed by others who had, but couldn't explain why they were failing.

runn commented 4 hours ago

I'm likely quite out of date on all of this having not written any COM professionally for a good while, so I'm probably missing something totally obvious, but I thought you could use SysAllocString before CoInit because the allocator is available before CoInit is called.

https://learn.microsoft.com/en-us/windows/win32/api/objbase/nf-objbase-coinitialize#remarks

If i write a little test and call SysAllocString without CoInit then i get a proper BSTR back with the length, the embedded wchar and the pointer to it. SysFreeString and SysStringLen all work fine.

I always thought we needed coinit for any object creation, but there's a bunch of oleauto stuff that's ok. Though the docs for SysAllocString makes no claims either way. Like i say, it's been a while!

Anyway - i guess that's all besides the point. I just meant allocate it early and it's probably not even an especially useful comment in the first place :)

Even worse, I have no better alternatives outside of what the platform module already does in the case that the timeout is triggered in the WMI call. Microsoft really seems to push the idea that no one should depend on the version and check for compatibility instead, which isn't what is wanted here I suppose.

I suspect the WMI failures happen in a corp environment where machines have all sorts of invasive stuff installed that intercept Win32 or COM calls. I can only reproduce this sporadically. Love me a Heisenbug :)