python / cpython

The Python programming language
https://www.python.org
Other
63.19k stars 30.26k forks source link

test_ctypes.test_load_dll_with_flags Windows failure #84395

Closed aeros closed 4 years ago

aeros commented 4 years ago
BPO 40214
Nosy @pfmoore, @tjguk, @zware, @eryksun, @zooba, @miss-islington, @tirkarthi, @aeros
PRs
  • python/cpython#19404
  • python/cpython#19413
  • python/cpython#19449
  • python/cpython#19500
  • python/cpython#19652
  • python/cpython#19656
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-bug', '3.8', '3.9', 'ctypes', 'tests', 'OS-windows'] title = 'test_ctypes.test_load_dll_with_flags Windows failure' updated_at = user = 'https://github.com/aeros' ``` bugs.python.org fields: ```python activity = actor = 'miss-islington' assignee = 'none' closed = True closed_date = closer = 'steve.dower' components = ['Tests', 'Windows', 'ctypes'] creation = creator = 'aeros' dependencies = [] files = [] hgrepos = [] issue_num = 40214 keywords = ['patch'] message_count = 27.0 messages = ['365887', '365890', '365892', '365900', '365902', '365910', '365919', '365953', '365954', '365955', '365970', '365971', '365993', '365999', '366003', '366044', '366045', '366307', '366316', '366319', '366320', '366322', '366324', '366327', '367013', '367014', '369661'] nosy_count = 8.0 nosy_names = ['paul.moore', 'tim.golden', 'zach.ware', 'eryksun', 'steve.dower', 'miss-islington', 'xtreak', 'aeros'] pr_nums = ['19404', '19413', '19449', '19500', '19652', '19656'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue40214' versions = ['Python 3.8', 'Python 3.9'] ```

    aeros commented 4 years ago

    In several recent PRs, test_ctypes.test_load_dll_with_flags is failing for the Azure Pipelines "Windows PR tests win32" and "Windows PR tests win64" with the following error message:

    ======================================================================
    ERROR: test_load_dll_with_flags (ctypes.test.test_loading.LoaderTest) [WinDLL('_sqlite3.dll', winmode=0)]
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "d:\a\1\s\lib\ctypes\test\test_loading.py", line 140, in should_pass
        subprocess.check_output(
      File "d:\a\1\s\lib\subprocess.py", line 419, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "d:\a\1\s\lib\subprocess.py", line 533, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['d:\\a\\1\\s\\PCbuild\\win32\\python.exe', '-c', "from ctypes import *; import nt;WinDLL('_sqlite3.dll', winmode=0)"]' returned non-zero exit status 1.
    
    ======================================================================
    ERROR: test_load_dll_with_flags (ctypes.test.test_loading.LoaderTest) [WinDLL('_sqlite3.dll', winmode=0)]
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "d:\a\1\s\lib\ctypes\test\test_loading.py", line 140, in should_pass
        subprocess.check_output(
      File "d:\a\1\s\lib\subprocess.py", line 419, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "d:\a\1\s\lib\subprocess.py", line 533, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['d:\\a\\1\\s\\PCbuild\\win32\\python.exe', '-c', "from ctypes import *; import nt;WinDLL('_sqlite3.dll', winmode=0)"]' returned non-zero exit status 1.

    Since this only started occurring recently in several unrelated PRs, I suspect it was most likely an intermittent regression introduced in master. Here are the PRs I have seen the same exact error occur in so far:

    https://github.com/python/cpython/pull/18239 https://github.com/python/cpython/pull/19403 https://github.com/python/cpython/pull/19402 https://github.com/python/cpython/pull/19399

    I was unable to reproduce it locally on my secondary boot of Windows 10.

    zware commented 4 years ago

    My best guess at the moment is that something changed in the underlying Windows image, particularly given the comment above the failing line that "insecure load flags should succeed" (though that could easily be a red herring). Until someone who actually does understand what's going on here (Steve? :)) can deal with it properly, I have a PR to disable the problematic part of the test that I'll merge shortly.

    zware commented 4 years ago

    New changeset f407e209c1e35b64835f73e7e7ca23e33817e9fe by Zachary Ware in branch 'master': bpo-40214: Temporarily disable a ctypes test (GH-19404) https://github.com/python/cpython/commit/f407e209c1e35b64835f73e7e7ca23e33817e9fe

    aeros commented 4 years ago

    Zachary Ware wrote:

    My best guess at the moment is that something changed in the underlying Windows image, particularly given the comment above the failing line that "insecure load flags should succeed" (though that could easily be a red herring).

    Also, now that I think about it, the the test failure has only specifically occurred on Azure Pipelines and not on the GitHub Actions Windows PR tests (as far as I have seen), which does further increase the likelihood that it's an issue with the Windows image being used in Azure for the PR tests.

    Regardless of the cause though, thanks for applying a temporary fix/skip in the meantime, these types of issues can be troublesome for CPython's workflow when it occurs on required CI checks. Hopefully Steve Dower or someone similarly knowledgeable in the area can definitively figure out the cause.

    zooba commented 4 years ago

    Maybe something else was installed that put an incompatible _sqlite3.dll on PATH? Thereby proving the inherent risk of using unsafe DLL load settings :)

    If we're not already overriding PATH for the subprocess in this test (should only need "%SystemRoot%;%SystemRoot%\System32"), we may have to. Though if someone is installing an incompatible _sqlite3.dll into a system directory then we're in a bit of trouble.

    Disabling this part of the test as inherently untestable because it's an inherently unsafe operation is fine by me :)

    eryksun commented 4 years ago

    Maybe something else was installed that put an incompatible _sqlite3.dll on PATH? Thereby proving the inherent risk of using unsafe DLL load settings

    should_pass() sets the working directory of the test process to the tmp directory. The loader normally checks the working directory before PATH. But maybe the system itself is configured to disallow loading DLLs from the working directory. There's a registry setting for that, but it's little known and rarely used because it's disruptive in general.

    zware commented 4 years ago

    Assuming I implemented my checks correctly (see PR19413), I think I've just debunked both of our leading theories.

    Results here: https://dev.azure.com/Python/cpython/_build/results?buildId=60764&view=logs&j=d554cd63-f8f4-5b2d-871b-33e4ea76e915&t=5a14d0eb-dbd4-5b80-f5d0-7909f950a1cc

    zooba commented 4 years ago

    It's probably the sqlite3.dll dependency that's failing, not the _sqlite3.dll (originally _sqlite3.pyd) one.

    The test is verifying that dependent DLLs are loaded correctly. So I assume in this case the _sqlite3.dll is being loaded, but it's finding the wrong sqlite3.dll.

    Since I could, I updated Zach's PR to check for other sqlite3.dll files. The new build should be at https://dev.azure.com/Python/cpython/_build/results?buildId=60785&view=logs&j=c83831cd-3752-5cc7-2f01-8276919eb334 in a few minutes.

    zooba commented 4 years ago

    It might also be a "sqlite3_d.dll". Updated build is https://dev.azure.com/Python/cpython/_build/results?buildId=60787&view=results

    zooba commented 4 years ago

    Found it (and it's kind-of us):

    Checking C:\Program Files\Amazon\AWSCLIV2\ Found at C:\Program Files\Amazon\AWSCLIV2\sqlite3.dll Found at C:\Program Files\Amazon\AWSCLIV2\_sqlite3.pyd

    But I'm not sure why that is getting loaded earlier than the current directory. Is that the behaviour we went for here?

    (FWIW, we build and test a release build, not a debug build, which is why we're not looking for sqlite3_d.dll... but perhaps we should be using a debug build? That might be slower, but the extra validation is probably worthwhile...)

    aeros commented 4 years ago

    Steve Dower wrote:

    (FWIW, we build and test a release build, not a debug build, which is why we're not looking for sqlite3_d.dll... but perhaps we should be using a debug build? That might be slower, but the extra validation is probably worthwhile...)

    In general, I think a debug build instead of a release would make more sense for the PR tests. Do you have a general estimate or rough idea as to how much slower it would be in comparison?

    tirkarthi commented 4 years ago

    Is this going to be backported? It seems backports also use the same build and have this error. Sample 3.8 backport build that seems to be related to this issue : https://dev.azure.com/Python/cpython/_build/results?buildId=60753&view=logs&j=d554cd63-f8f4-5b2d-871b-33e4ea76e915&t=5a14d0eb-dbd4-5b80-f5d0-7909f950a1cc&l=1570

    zware commented 4 years ago

    Feel free to backport PR19404 as needed, but mark versions here appropriately to make sure the *real* fix makes it where it needs to go.

    zooba commented 4 years ago

    Do you have a general estimate or rough idea as to how much slower it would be in comparison?

    It's one sample point, but compare https://buildbot.python.org/all/#/builders/129/builds/708 to https://github.com/python/cpython/runs/571497886

    Compile time: 3:32 (release) -> 1:10 (debug) Test time: 12:28 (release) -> 15:31 (debug)

    Though the test timing vary wildly, as some tests cause more contention than others and they run in a random order.

    aeros commented 4 years ago

    Steve Dower wrote:

    It's one sample point, but compare https://buildbot.python.org/all/#/builders/129/builds/708 to https://github.com/python/cpython/runs/571497886

    FWIW, I'd be +1 in favor for using the debug build then. A few additional minutes would be well worth having more thorough PR tests IMO. If it becomes an issue, we can always revert it back, but it seems that there's little to no harm in trying it out.

    vstinner commented 4 years ago

    Python 3.8 is also affected. Example: https://github.com/python/cpython/pull/19444#issuecomment-611499825

    I backported the change to skip the test until it's fixed: PR 19449.

    miss-islington commented 4 years ago

    New changeset c83f003ee5398e9c27a0c634329420003d607d46 by Miss Islington (bot) in branch '3.8': bpo-40214: Temporarily disable a ctypes test (GH-19404) https://github.com/python/cpython/commit/c83f003ee5398e9c27a0c634329420003d607d46

    eryksun commented 4 years ago

    But I'm not sure why that is getting loaded earlier than the current directory. Is that the behaviour we went for here?

    I don't understand what's going on here if %PATH% is interfering. The current directory (%CD%) should be checked before %PATH% with the standard DLL search order for desktop applcations. That's what I've observed in practice and how it's documented 1:

    SafeDllSearchMode enabled:
    
    1. %__APPDIR__%
    2. %SystemRoot%\System32
    3. %SystemRoot%\System
    4. %SystemRoot%
    5. %__CD__%
    6. %PATH%
    
    SafeDllSearchMode disabled:
    
    1. %__APPDIR__%
    2. %__CD__%
    3. %SystemRoot%\System32
    4. %SystemRoot%\System
    5. %SystemRoot%
    6. %PATH%
    zooba commented 4 years ago

    I don't understand what's going on here if %PATH% is interfering.

    Yeah, as shown by the fact my fix didn't fix it :)

    I'm going to try adding some more diagnostics around this to find out exactly where it's failing. I suspect it's a dependency, rather than the direct file.

    zooba commented 4 years ago

    So I collected more info, and it seems the search order we're getting for _sqlite3.dll (.pyd) is:

    And then we never look in CWD. I wonder if it's being excluded because it's under %TEMP%?

    I can't reproduce it either with a long PATH or with it in TEMP, so there may be a policy setting involved. Trying again with the test not using TEMP (it shouldn't be using the real TEMP anyway...)

    zooba commented 4 years ago

    Okay, I've wasted enough time on this. Let's just disable the test permanently and say that the winmode=0 behaviour is system-specific and is not tested.

    eryksun commented 4 years ago

    it seems the search order we're getting for _sqlite3.dll (.pyd) is:

    • app directory
    • SQL server install directory (?)
    • Windows/System32
    • Windows
    • %PATH% (which is truncated at 2190 chars)

    Thank you. It finally makes sense. The parent process is calling SetDllDirectoryW 1, which also replaces the current working directory in the DLL search path of child processes (inherited via the PEB ProcessParameters->DllPath) 2:

    Note: The standard search order of the process will also be 
    affected by calling the SetDllDirectory function in the parent
    process before start of the current process.

    In that case, all we have to do is restore the original search path by calling SetDllDirectoryW(NULL). For example:

        import ctypes
        kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
        if not kernel32.SetDllDirectoryW(None):
            raise ctypes.WinError(ctypes.get_last_error())
    zooba commented 4 years ago

    You mean the CI agent is doing it? Because there's nowhere in Python that should be doing it.

    I'd rather not blanket change this option in the test suite (except to turn it on :) ), so that code snippet probably needs to shrink down into a one-liner that can go with the "-c" just for these tests.

    eryksun commented 4 years ago

    You mean the CI agent is doing it? Because there's nowhere in Python that should be doing it.

    The ProcessParameters->DllPath value is inherited until a process in the tree reverts to the original search path via SetDllDirectoryW(NULL), so it can be any ancestor process, not just the immediate parent process.

    I'd rather not blanket change this option in the test suite (except to turn it on :) ), so that code snippet probably needs to shrink down into a one-liner that can go with the "-c" just for these tests.

    The test could call GetDllDirectoryW 1 to save and restore the previous value. Unfortunately, GetDllDirectoryW returns 0 for both the default case (NULL) and when the working directory is disabled by setting an empty string (""). It would have to assume one or the other.

    Otherwise if you just want a one-liner: "import ctypes; ctypes.windll.kernel32.SetDllDirectoryW(None)".

    zooba commented 4 years ago

    New changeset 9b498939009f49b8c772c89e8fc80efbfd8afcb5 by Steve Dower in branch 'master': bpo-40214: Fix ctypes WinDLL test with insecure flags (GH-19652) https://github.com/python/cpython/commit/9b498939009f49b8c772c89e8fc80efbfd8afcb5

    zooba commented 4 years ago

    Looks like that fix has done it. Thanks, Eryk!

    miss-islington commented 4 years ago

    New changeset 0cc7becde0bfe896fd23b5cb14fedfb8f2066fca by Miss Islington (bot) in branch '3.8': bpo-40214: Fix ctypes WinDLL test with insecure flags (GH-19652) https://github.com/python/cpython/commit/0cc7becde0bfe896fd23b5cb14fedfb8f2066fca