python / cpython

The Python programming language
https://www.python.org
Other
63.37k stars 30.33k forks source link

CPython DLL initialization routine failed from PYC cache file #86695

Closed 8d2ceb9f-caae-40fc-87bd-93204db558b1 closed 3 years ago

8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago
BPO 42529
Nosy @pfmoore, @tjguk, @zware, @eryksun, @zooba, @WildCard65, @Thrameos
Files
  • Capture.PNG: Attachment of zipimport exception
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-bug', '3.9', 'OS-windows'] title = 'CPython DLL initialization routine failed from PYC cache file' updated_at = user = 'https://github.com/Thrameos' ``` bugs.python.org fields: ```python activity = actor = 'Thrameos' assignee = 'none' closed = True closed_date = closer = 'Thrameos' components = ['Windows'] creation = creator = 'Thrameos' dependencies = [] files = ['49665'] hgrepos = [] issue_num = 42529 keywords = [] message_count = 36.0 messages = ['382275', '382277', '382846', '382854', '382855', '383025', '383158', '383176', '383234', '383251', '383254', '383255', '383256', '383305', '383325', '383333', '383337', '383339', '383355', '383356', '383358', '383363', '383373', '383374', '383375', '383389', '383390', '383393', '383394', '383395', '383416', '383418', '383420', '383421', '383429', '383432'] nosy_count = 7.0 nosy_names = ['paul.moore', 'tim.golden', 'zach.ware', 'eryksun', 'steve.dower', 'WildCard65', 'Thrameos'] pr_nums = [] priority = 'normal' resolution = 'third party' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue42529' versions = ['Python 3.9'] ```

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    While trying to use JPype on Windows Python 3.9.0, we are running into an bizarre issue with loading the internal module which is written in C. When running a python script the first time the internal module loads correctly. However, the second time that script is run the internal module reports "A dynamic link library (DLL) initialization routine failed." If you then erase the pyc cache file that is importing the internal module then it works again. This only occurs on Windows and was not present using the same source prior versions of Python 3.8.

    We investigate the byte codes from both version and they are doing the same series of actions so the problem appears to be calling the same opcode to execute an import.

    I make sure all required symbols were found in the libraries and only only copy of the internal DLL was the same with and without loading from a pyc.

    It may be a change in the requirements of module initialization, but I don't know how to proceed. There was one deprecation warning but correcting that did not alter the outcome.

    It appears that the execute path for importing a CPython module takes a different path when the script was imported from something compiled on the fly as opposed to loading from a pyc.

     import: 'jpype'
     Traceback (most recent call last):
        File "D:\bld\jpype1_1605785280189\test_tmp\run_test.py", line 2, in <module>
          import jpype
        File "D:\bld\jpype1_1605785280189\_test_env\lib\site-packages\jpype\__init__.py", line 18, in <module>
          import _jpype
     ImportError: DLL load failed while importing _jpype: A dynamic link library (DLL) initialization routine failed.
    zooba commented 3 years ago

    There really shouldn't be any difference between those two paths, apart from skipping the compile step in the second case.

    It sounds like you're pretty well set up for testing, so would you be able to try the test while running under either the Visual Studio debugger (easiest way is to add an "input()" to your code, run to there, then Debug/Attach and press enter to let it continue) or WinDbg if you prefer that? That will give much more information about which DLL is actually failing, and possibly why.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    I managed to get the debugger attached (unfortunately I am not a windows programmer so I don't use these tools).

    It appears when loading from a pyc, it is attempting to open the directory as a Zip file which is throwing an exception resulting in a failure to load the internal _jpype module.

    Unfortunately this is outside my area as nothing in jpype nor _jpype is calling the zipimport module. So it must be something internal to Python. It is a major issue as I can't release to anaconda on windows until I can resolve the problem. Can you direct me how to proceed?

    Details:

    zipimport.ZipImportError Message=not a Zip file StackTrace: \<frozen zipimport>:88 in __init__

    Replicating the problem is simple. Download jpype from git, use "python setup.py develop" to create the module, run python -c "import jpype", and repeat python -c "import jpype"

    Screen shot of the visual studio error is shown.

    zooba commented 3 years ago

    Replicating also requires installing Java (I got as far as the build failing), which is going to have to wait until I have time to spin up a throwaway machine :)

    That exception is a normal part of the import process and should be being handled. It's not the same one that eventually terminates the program.

    I'd continue past that one and keep continuing until you hit one that looks like "DLL initialization routine failed". That *should* come out of the DLL that is actually failing, which Python isn't able to identify, and will give us hints toward the actual root cause.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    Oddly that was the only exception that I got. When I hit continue it proceeded without issue and the dll was loaded correctly. However, when I try without the debugger attached the error reappears. So this is one of those Schrodinger errors. I know the error is there because I see it when we run on the CI and on a local machine, but all attempts to isolate it say that is should not exist. There can't really be a path that is different when compiling then loading from a pyc unless this is some type of uninitialized variable issues where we just happened to have been unfortunate enough to hit it.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    Any progress on this item? I am seeing additional reports of this error in the conda stream tracker and elsewhere. As it only occurs when the debugger is not hooked up I don't have much traction to make progress myself.

    zooba commented 3 years ago

    Sorry, I haven't had a chance to set up a test machine with all the requirements.

    It's almost certainly something in jpype, to be clear. Most likely it loads a DLL that hasn't been loaded yet, but does it under conditions where it won't load. Nothing we build as part of CPython should do anything interesting on load, at least nothing we can change/fix. So jpype will probably need an update to import something manually.

    The only thing I can think of that changed along these lines in 3.9.0 is that we no longer LoadLibrary("api-ms-win-core-path-l1-1-0.dll") at startup (because we can assume that it's present now). That shouldn't matter at all unless jpype is doing some really weird stuff, as it's a system library with no code. But maybe it'll give someone a hint who knows the module better?

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    I am fairly sure this is a Python "bug" in the sense that there was some change in undocumented change in requirements and the distutils pattern for building a module no longer reflects that requirement. That said very likely JPype is the only module to be affected and thus I will have to manually adjust to account for the new requirement.

    Unfortunately as far as developers, I am it so fixing it (with your assistance) is going to have to fall on me. So lets do a run down of how this all working so you can point me where to look.

    1) JPype gets built as an ordinary setup.py CPython module. So it is possibly a bug in the build pattern of the customizers in JPype that was exposed by Python 3.9. I am just going to cut and paste so that it is easy to follow along.

    jpype/setup.py

    ...
    from setuptools import Extension
    ...
    import setupext
    ...
    jpypeLib = Extension(name='_jpype', **setupext.platform.Platform(
        include_dirs=[Path('native', 'common', 'include'),
                      Path('native', 'python', 'include'),
                      Path('native', 'embedded', 'include')],
        sources=[Path('native', 'common', '*.cpp'),
                 Path('native', 'python', '*.cpp'),
                 Path('native', 'embedded', '*.cpp')], platform=platform,
    ))

    We have two sets of customization in setup.py. Both are included from a module called jpype/setupext/

    The key one is the jpype/setupext/platform.py which defines the compiler flags. There are two sections of interest...

    jpype/setupext/platform.py contains these modifications for win32.
    (So there is the Advapi32, not sure why it appears there because this section is all before my time as this was started in 2001 and I joined in 2018)

        static = True
        if platform == 'win32':
            distutils.log.info("Add windows settings")
            platform_specific['libraries'] = ['Advapi32']
            platform_specific['define_macros'] = [('WIN32', 1)]
            if sys.version > '3':
                platform_specific['extra_compile_args'] = [
                    '/Zi', '/EHsc', '/std:c++14']
            else:
                platform_specific['extra_compile_args'] = ['/Zi', '/EHsc']
            platform_specific['extra_link_args'] = ['/DEBUG']
            jni_md_platform = 'win32'

    The second section is currently commented out. Though it is relevant because JPype is exotic in one way. It is a module which is loaded in three ways. When imported from Python it is an ordinary library (1) which will later pull in Java which will then load library a second time (2) to bind Java native methods. It can also be used to launch Python if Java starts the session (3). In that case, it needs libpython.dll to be bound to module so that the Java equivalent to LoadLibrary can work. When it does Java first manually loads libpython.dll then loads the module and calls the hook to get Python started.

    # This code is used to include python library in the build when starting Python from
    # within Java.  It will be used in the future, but is not currently required.
    #    if static and sysconfig.get_config_var('BLDLIBRARY') is not None:
    #        platform_specific['extra_link_args'].append(sysconfig.get_config_var('BLDLIBRARY'))

    The actual buildext has a few minor patches so that Java libraries can run through the normal process. But nothing seems like a good candidate

    We have one section tweeking some of the build options.

        def initialize_options(self, *args):
            """omit -Wstrict-prototypes from CFLAGS since its only valid for C code."""
            self.android = False
            self.makefile = False
            self.jar = False
            import distutils.sysconfig
            cfg_vars = distutils.sysconfig.get_config_vars()
            replacement = {
                '-Wstrict-prototypes': '',
                '-Wimplicit-function-declaration': '',
            }
            tracing = self.distribution.enable_tracing
    
            # Arguments to remove so we set debugging and optimization level
            remove_args = ['-O0', '-O1', '-O2', '-O3', '-g']
    
            for k, v in cfg_vars.items():
                if not isinstance(v, str):
                    continue
                if not k == "OPT" and not "FLAGS" in k:
                    continue
    
                args = v.split()
                for r in remove_args:
                    args = list(filter((r).__ne__, args))
    
                cfg_vars[k] = " ".join(args)
            super().initialize_options()

    Then later we interrupt the build process for Java.

        def build_extension(self, ext):
            if ext.language == "java":
                return self.build_java_ext(ext)
            if self.jar:
                return
            print("Call build ext")
            return super().build_extension(ext)

    2) Next we have the module start. I am guessing this is not the source of the error because adding a printf shows we never got to this point.

    jpype/native/pyjp_module.cpp

    PyMODINIT_FUNC PyInit__jpype()
    {
            JP_PY_TRY("PyInit__jpype");
            JPContext_global = new JPContext();
            // This is required for python versions prior to 3.7.
            // It is called by the python initialization starting from 3.7,
            // but is safe to call afterwards.
            PyEval_InitThreads();
    
            // Initialize the module (depends on python version)
            PyObject* module = PyModule_Create(&moduledef);
            Py_INCREF(module);
            PyJPModule = module;
            PyModule_AddStringConstant(module, "__version__", "1.2.1_dev0");
             ... initialize a bunch of resources
         }
    

    3) After this point we will start getting exotic. JPype plugs into Python in with less that above board methods. In order to operate it needs to derive a bunch of Python basic types in ways that the API does not allow. As Python does not support true multiple inheritance of concrete types and I need to add slots that should not exist, it will use a custom memory allocator to offset the addresses of object pointers such that a hidden Java slot appears on types type, object, long, double, exception, etc. To do this it will forward declare some of the private Python symbols having to do with heap types and stacktrace frames, etc. There is also one monkey patch for the "function" type in which it temporarily disables the "final" field so that we create a derived class that should not exist for Java methods, then restores the field immediately after.

    This is relevant because if the symbols it is pulling in are missing for some reason then the dll will fail to load. However, given it will load when it isn't compiled and not load when compiled there is not much chance the symbol table should be different. Unless windows is really different from other architectures, this is a long shot.

    3) Okay so on to module loading

    import _jpype

    This statement works when the debugger is attached, when called from a .py file but fails when called from a .pyc file. I have looked at it all the way down to the byte code. There appears to be nothing in JPype that is going into that process. This just jumps straight into Python core code.

    4) Post load a whole lot of stuff happens. We patch in the new types, and then use LoadLibrary to pull in jvm.dll. As we work on many JVM versions and we won't know which is requested, JPype is not actually compiled against the jvm. Instead it manually loads all its symbols using GetProcAddress in jpype/native/jp_platform.cpp.

    So that is all there is (dirty laundary and all). Does anything look suspect?

    zooba commented 3 years ago

    Monkeypatching native types looks suspect as anything, but probably not the cause here. Good luck :)

    So I checked some of our code around dynamic loads and it's got to be caused by the loader in our LoadLibrary call, not when we call your module init function. That rules out basically anything you're doing once the module is loaded.

    The only thing I can think of that may be missing in the .pyc case is encodings. They *should* be being imported automatically on load, but perhaps not as we're constantly trying to improve startup time. You could try adding "import codecs" and then codecs.lookup some codecs.

    You might also try running Python in verbose mode ("python -v ...") and comparing the output between the two cases. That could show up what extra things are being skipped in the failing case.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    Thanks, I will see if I can get additional diagnostics today. Otherwise I will have to start recompiling Python with diagnostic hooks. I have never had to rebuild python on windows so it may take a while to figure out how to make progress.

    The monkey patch is less scary than it looks. Since I only need these slots of specific basic types that I designate (not user selected), the extra types all extend without altering the underlaying type, and the class layout is controlled by a meta class so that only Java can create these type objects, it is reasonably robust. However, yes changes in Python can break it. It is basically an attempt to create a mixin for concrete classes without the benefit of a dictoffset type slot. It took about a month do develop something that safely added the slot without overrunning, leaking or creating other memory issues (and played nice when Python added its own dict slot).

    Unfortunately, as I do many 100s on JSlot tests during method resolution using a dict based approach was unacceptably slow. I would love if there was a formal method to create multiple inheritance mixins for Python basic types, but that is another issue.

    /**
     * Allocate a new Python object with a slot for Java.
     *
     * We need extra space to store our values, but because there
     * is no way to do so without disturbing the object layout.
     * Fortunately, Python already handles this for dict and weakref.
     * Python aligns the ends of the structure and increases the
     * base type size to add additional slots to a standard object.
     *
     * We will use the same trick to add an additional slot for Java
     * after the end of the object outside of where Python is looking.
     * As the memory is aligned this is safe to do.  We will use
     * the alloc and finalize slot to recognize which objects have this
     * extra slot appended.
     */
    PyObject* PyJPValue_alloc(PyTypeObject* type, Py_ssize_t nitems )
    {
        // Modification from Python to add size elements
        const size_t size = _PyObject_VAR_SIZE(type, nitems + 1) + sizeof (JPValue);
        PyObject *obj = (PyType_IS_GC(type)) ? _PyObject_GC_Malloc(size)
                : (PyObject *) PyObject_MALLOC(size);
        if (obj == NULL)
            return PyErr_NoMemory(); // GCOVR_EXCL_LINE
        memset(obj, 0, size);
    
        Py_ssize_t refcnt = ((PyObject*) type)->ob_refcnt;
        if (type->tp_itemsize == 0)
            PyObject_Init(obj, type);
        else
            PyObject_InitVar((PyVarObject *) obj, type, nitems);
    
        // This line is required to deal with Python bug (GH-11661)
        // Some versions of Python fail to increment the reference counter of
        // heap types properly.
        if (refcnt == ((PyObject*) type)->ob_refcnt)
            Py_INCREF(type);  // GCOVR_EXCL_LINE
    
        if (PyType_IS_GC(type))
        {
            PyObject_GC_Track(obj);
        }
        return obj;
    }
    
    bool PyJPValue_hasJavaSlot(PyTypeObject* type)
    {
        if (type == NULL
                || type->tp_alloc != (allocfunc) PyJPValue_alloc
                || type->tp_finalize != (destructor) PyJPValue_finalize)
            return false;  // GCOVR_EXCL_LINE
        return true;
    }
    
    // Check for a Java slot (must work on all object types)
    Py_ssize_t PyJPValue_getJavaSlotOffset(PyObject* self)
    {
        PyTypeObject *type = Py_TYPE(self);
        if (type == NULL
                || type->tp_alloc != (allocfunc) PyJPValue_alloc
                || type->tp_finalize != (destructor) PyJPValue_finalize)
            return 0;
        Py_ssize_t offset;
        Py_ssize_t sz = Py_SIZE(self);
        // I have no clue what negative sizes mean
        if (sz < 0)
            sz = -sz;
        if (type->tp_itemsize == 0)
            offset = _PyObject_VAR_SIZE(type, 1);
        else
            offset = _PyObject_VAR_SIZE(type, sz + 1);
        return offset;
    }
    
    JPValue* PyJPValue_getJavaSlot(PyObject* self)
    {
        Py_ssize_t offset = PyJPValue_getJavaSlotOffset(self);
        if (offset == 0)
            return NULL;
        JPValue* value = (JPValue*) (((char*) self) + offset);
        if (value->getClass() == NULL)
            return NULL;
        return value;
    }
    
    void PyJPValue_free(void* obj)
    {
        // Normally finalize is not run on simple classes.
        PyTypeObject *type = Py_TYPE(obj);
        if (type->tp_finalize != NULL)
            type->tp_finalize((PyObject*) obj);
        if (type->tp_flags & Py_TPFLAGS_HAVE_GC)
            PyObject_GC_Del(obj);
        else
            PyObject_Free(obj);  // GCOVR_EXCL_LINE
    }
    
    void PyJPValue_finalize(void* obj)
    {
        JPValue* value = PyJPValue_getJavaSlot((PyObject*) obj);
        ... destroy the objects hidden slot if it exists
    }
    
    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    Without the pyc everything goes fine...

    ...
    _bootlocale' # <_frozen_importlib_external.SourceFileLoader object at 0x00000222E9FD5A30>
    import 'site' # <_frozen_importlib_external.SourceFileLoader object at 0x00000222E9F88F40>
    Python 3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    # code object from c:\users\nelson85\documents\devel\open\jpype\jpype\__init__.py
    # created 'c:\\users\\nelson85\\documents\\devel\\open\\jpype\\jpype\\__pycache__\\__init__.cpython-39.pyc'
    # extension module '_jpype' loaded from 'c:\\users\\nelson85\\documents\\devel\\open\\jpype\\_jpype.cp39-win_amd64.pyd'
    # extension module '_jpype' executed from 'c:\\users\\nelson85\\documents\\devel\\open\\jpype\\_jpype.cp39-win_amd64.pyd'
    import '_jpype' # <_frozen_importlib_external.ExtensionFileLoader object at 0x00000222EA4C6250>
    # c:\users\nelson85\documents\devel\open\jpype\jpype\__pycache__\_jinit.cpython-39.pyc matches c:\users\nelson85\documents\devel\open\jpype\jpype\_jinit.py
    # code o
    ...

    With the pyc file, failure...

    ...
    _bootlocale' # <_frozen_importlib_external.SourceFileLoader object at 0x0000015278445A30>
    import 'site' # <_frozen_importlib_external.SourceFileLoader object at 0x00000152783F8F40>
    Python 3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    # c:\users\nelson85\documents\devel\open\jpype\jpype\__pycache__\__init__.cpython-39.pyc matches c:\users\nelson85\documents\devel\open\jpype\jpype\__init__.py
    # code object from 'c:\\users\\nelson85\\documents\\devel\\open\\jpype\\jpype\\__pycache__\\__init__.cpython-39.pyc'
    Traceback (most recent call last):
      File "C:\Users\nelson85\Documents\test.py", line 1, in <module>
        import jpype
      File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
      File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 790, in exec_module
      File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
      File "c:\users\nelson85\documents\devel\open\jpype\jpype\__init__.py", line 18, in <module>
        import _jpype
      File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
      File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 666, in _load_unlocked
      File "<frozen importlib._bootstrap>", line 565, in module_from_spec
      File "<frozen importlib._bootstrap_external>", line 1108, in create_module
      File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
    ImportError: DLL load failed while importing _jpype: A dynamic link library (DLL) initialization routine failed.
    # clear builtins._
    # clear sys.path
    # clear sys.argv
    # clear sys.ps1
    # clear sys.ps2
    # clear sys.last_type
    # clear sys.last_value
    # clear sys.last_traceback
    # destroy jpype
    # clear sys.path_hooks
    # clear sys.path_importer_cache
    # clear sys.meta_path
    # clear sys.__interactivehook__
    # restore sys.stdin
    # restore sys.stdout
    # restore sys.stderr
    ...

    I see the codecs all get loaded earlier in the trace, so that can't be the issue. There were no differences (other than memory addresses) up to this point. So it must be going wrong at bootstrap_external:1108.

    So guessing from the message we failed in.

        def exec_module(self, module):
            """Initialize an extension module"""
            _bootstrap._call_with_frames_removed(_imp.exec_dynamic, module)
            _bootstrap._verbose_message('extension module {!r} executed from {!r}',
                             self.name, self.path)

    After that point it gets really fuzzy. I would guess we are going to import.c#_imp_create_dynamic_impl which takes us to _PyImport_LoadDynamicModuleWithSpec. But I will need some time to read code to see if this guess is correct.

    zooba commented 3 years ago

    You'll find it leads up to a LoadLibraryExW call in Python/dynload_win.c that is failing with ERROR_DLL_INIT_FAILED (1114) in GetLastError(). From that point on, it's looking solely at _jpype.pyd and its dependencies, and it could be any of them failing to load.

    You can rule out anything that's already loaded at that stage, since it won't be calling the initialization routine. And anything in your module initialization would cause an error later than this, so it's clean too.

    However, I think some static variables may be initialised at this point? So if you've got any statics, those might be working too hard (and statics in C++ can easily do too much work).

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    Well that gives me a place to search from. Unfortunately statics are not likely the source of the issue. I scrubbed all C++ structures from the project into a big structure which has only a single global pointer which I initialize in the init method. It is possible that one crept back in at some point, but it is a possibility. The use of actual C++ in this project is nearly zero. Other than exceptions, strings and the vector on the buried in some class, the rest has been purged already.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    I attempted another deep dive on this one.

    1) Removed manual library add platform_specific['libraries'] = ['Advapi32'] No change.

    2) Compared symbol wise imports Two imports changed PyIndex_Check PyObject_CheckBuffer plus one additional "fflush" dependency in api-ms-win-crt-stdio-l1-1-0.dll

    3) Compared symbol wise exports No changes detected.

    4) Disassembled both binaries Other than a addressing that is about a 0x1000 change in the relocation tables. Guessing that this is just a change in the number of pages. There were functional changes likely related to Python macros differences.

    5) Removed all references to printf and flush to remove api-ms-win-crt-stdio-l1-1-0.dll dependency. No change.

    6) Scanned the changelog and Code changes for the past year in Python/dynload_win.c.
    Nothing appears likely to trigger this change.

    7) Deleted the _jpype module entirely to verify that the error message is related to the copy that I am working on and not some other copy on system. Error message changes to "ModuleNotFoundError: No module named '_jpype'"

    8) Scanned the source code for static variables. None detected thus far. But it is a many hours long process. Structurally these should not exist based on the module design.

    Statics found thus far. 2 jobject and jmethodID pointer types One C++ class called PLATFORM_ADAPTER which is the required routine determining whether to load jvm.dll or jvm.so using LoadLibrary or dlopen or shload. The global pointer to the context C++ class; a bunch of statics related to logging when enabled. Python C style structure definitions.

    Everything was a primitive or pointer with the exception of one class. All pointers I found were initialized to constant values.

    I rewrote the PLATFORM_ADAPTER to use a pointer rather than be initialized at load time. No joy.

    (after scanning through 17k lines of code take break and ponder life)

    It is certainly possible I missed something here but it looks pretty unlikely this is the source of the error.

    9) Verified that the behavior does not occur on Python 3.8.5 with the current source code.
    Pass. This only happens on Python 3.9.0

    10) Scanned the linker line for differences. One major difference. In Python 3.8 the linker line contains quotes around all the arguments while in Python 3.9 the linker line is missing the quotes on all /LIBPATH statements. However this must be a logging difference as the statement without the quotes would not work.

    11) Manually ran linker command.
    Nothing changed, same behavior. So link command is a red herring.

    12) Profit???

    Okay I have once again exhausted everything that I can think of on the JPype side. So that leaves something in the build process that does not appear in the logs, a symbol conflict that so how only triggers on one path, or Python3.9.0 has a corrupt build in the wild.

    Steve is there anything else that I can try here?

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    Just for reference here are all the dependencies that _jpype has

    MSVCP140.dll                                                                                                                                                                python39.dll                                                                                                                                                                KERNEL32.dll                                                                                                                                                                VCRUNTIME140_1.dll                                                                                                                                                          VCRUNTIME140.dll                                                                                                                                                            api-ms-win-crt-runtime-l1-1-0.dll                                                                                                                                           api-ms-win-crt-string-l1-1-0.dll                                                                                                                                            api-ms-win-crt-heap-l1-1-0.dll

    So it isn't like there are a whole lot of things that I am doing that are likely to trigger a dependency problem.

    As far as I can see the only items that are not already on python39.dll dependency list are VCRUNTIME140_1.dll and MSVCP140.dll. But if that were the case then surely other projects would also be hitting this bug. So this error message makes no sense at all.

    cbbc42d0-1c72-4dd1-93d4-8553b09ac1c0 commented 3 years ago

    You may need to inject a LoadLibraryExW detour into your python runtime before _jpype is loaded and output all the library names its requesting.

    You may need to detour all Load Library functions for maximum coverage.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    Can you be so kind as pointing me to a LoadLibraryExW detour example for Python? I have shimmed a DLL before to capture transaction logs in the past, but not with a Python or a system library.

    cbbc42d0-1c72-4dd1-93d4-8553b09ac1c0 commented 3 years ago

    https://www.microsoft.com/en-us/research/project/detours/

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    I used Detours with trclnk64 to get the following log:

    20201218193836960 4332 50.60: trclnk64: 001   GetProcAddress(,) -> 7ffc4ccebef0
    20201218193836960 4332 50.60: trclnk64: 001   GetProcAddress(7ffc4ccd0000,LCMapStringEx)
    20201218193836960 4332 50.60: trclnk64: 001   GetProcAddress(,) -> 7ffc4cce6640
    ======== <= This is the point where it differs from a clean run.
    20201218193838163 4332 50.60: trclnk64: 001   GetProcAddress(7ffc4c710000,FlsFree)
    20201218193838163 4332 50.60: trclnk64: 001   GetProcAddress(,) -> 7ffc4c78b000
    20201218193838163 4332 50.60: trclnk64: 001   FAIL LoadLibraryExW(c:\users\nelson85\documents\devel\open\jpype\_jpype.cp39-win_amd64.pyd,0,1100)
    20201218193838164 4332 50.60: trclnk64: 001 LoadLibraryExW(,,) -> 0

    It really is nothing to go on... Everything succeeds, then we get an extra successful request for FlsFree then it immediately returns with a fail.

    In the successful run (without the cache) we get...

    20201218192451068 20440 50.60: trclnk64: 001   GetProcAddress(7ffc4ccd0000,LCMapStringEx)
    20201218192451068 20440 50.60: trclnk64: 001   GetProcAddress(,) -> 7ffc4cce6640
    20201218192451068 20440 50.60: trclnk64: 001 LoadLibraryExW(,,) -> 7ffc349f0000
    20201218192451068 20440 50.60: trclnk64: ### 349f0000: c:\users\nelson85\documents\devel\open\jpype\_jpype.cp39-win_amd64.pyd 00000000
    20201218192451068 20440 50.60: trclnk64: MSVCP140.dll [7ffc34a7af48 7ffc34a5a0c8]
    20201218192451068 20440 50.60: trclnk64:   ?_Lock@?$basic_streambuf@DU?$char_traits@D@std@@@std@@UEAAXXZ    0 7ffc36f22350
    20201218192451068 20440 50.60: trclnk64:   ?_Unlock@?$basic_streambuf@DU?$char_traits@D@std@@@std@@UEAAXXZ    0 7ffc36f22350

    So the failure appears to be internal to the call before we load MSVCP140.dll. Unfortunately, I am way outside my element here as a Linux/Java/CPython programmer. My next guess would be to instrument the set system error and if we are within LoadLibrary then execute "int *i=0; *i=0;" and hope the debugger would get me the point of failure so that I can examine the stack trace to find out where we are. But if you have another direction I would be happy to hear it. Is there some function I should add to trclnk64 to that may have gone wrong? It must be something that fetched the address of LCMapStringEx, but I have no clue what that may be.

    This really does not look like a JPype bug. Yes, two minutes in the future from where the car ran over the pedestrian, he was going to run into the bank and demand money (monkey patch and other bad things). But at the point in time the car hit, the pedestrian was just minding his own business.

    cbbc42d0-1c72-4dd1-93d4-8553b09ac1c0 commented 3 years ago

    I was just expecting only detours for LoadLibraryExW (and variants) to find out which dll failed.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    The last libraries loaded prior to the failure were...

    20201218192451066 20440 50.60: trclnk64: api-ms-win-eventing-provider-l1-1-0.dll [7ffc4c974108 7ffc4c8b7808]
    20201218192451066 20440 50.60: trclnk64:   EventUnregister    0 7ffc4eab37a0
    20201218192451066 20440 50.60: trclnk64:   EventRegister    0 7ffc4ea6a640
    20201218192451066 20440 50.60: trclnk64:   EventActivityIdControl    0 7ffc4eac6190
    20201218192451066 20440 50.60: trclnk64:   EventWriteTransfer    0 7ffc4eab2cf0
    20201218192451066 20440 50.60: trclnk64:   EventSetInformation    0 7ffc4ea6a3e0
    20201218192451066 20440 50.60: trclnk64: api-ms-win-core-apiquery-l1-1-0.dll [7ffc4c9740e8 7ffc4c8b77e8]
    20201218192451066 20440 50.60: trclnk64:   ApiSetQueryApiSetPresence    0 7ffc4ead5030
    20201218192451066 20440 50.60: trclnk64: api-ms-win-core-apiquery-l1-1-1.dll [7ffc4c9740f8 7ffc4c8b77f8]
    20201218192451066 20440 50.60: trclnk64:   ApiSetQueryApiSetPresenceEx    0 7ffc4eb35730

    And as far as I can tell that was successful. Perhaps there is another library at this point, but my reading of the log is that LoadLibraryExW was called run though all required elements to load, got the end section before it started to load dependencies of _jpype, started fetching things that have to do with locale and then hit a failure point inside of LoadLibrary. If I look at the successful trace then I see the next action is to start linking in C++ symbols.

    But perhaps I am misreading... Looking farther back in the log, the last LoadLibrary to get called before the failure is.

    20201218192451054 20440 50.60: trclnk64: 001   LoadLibraryExW(api-ms-win-core-fibers-l1-1-1,0,800)
    20201218192451054 20440 50.60: trclnk64: 001   LoadLibraryExW(,,) -> 7ffc4c710000
    20201218192451054 20440 50.60: trclnk64: ### 4c710000: C:\WINDOWS\System32\KERNELBASE.dll 002b306b
    20201218192451054 20440 50.60: trclnk64: ntdll.dll [7ffc4c974138 7ffc4c8b7838]

    Fibers? I have no clue at this point. Is this a Python bug or a windows bug?

    eryksun commented 3 years ago

    ImportError: DLL load failed while importing _jpype: A dynamic link library (DLL) initialization routine failed.

    With loader snaps enabled for python[_d].exe (i.e. loader debug messages), you can attach a debugger to discover which DLL's init routine is failing. For example, you'll see a debug message like the following:

    xxxx:xxxx @ xxxxxxxx - LdrpInitializeNode - ERROR:
    Init routine XXXXXXXXXXXXXXXX for DLL "Path\To\DLL"
    failed during DLL_PROCESS_ATTACH

    The entry point of a DLL is typically _DllMainCRTStartup, which initializes the CRT for the DLL and calls DllMain(). In any case, you can find the relative address of the entry point in the PE header, via dumpbin.exe or the debugger !dh command. Set a breakpoint on it and step through to find where it fails in comparison with a working environment.

    zooba commented 3 years ago

    Everything succeeds, then we get an extra successful request for FlsFree then it immediately returns with a fail.

    That makes it sound like the CRT is uninitialising itself, so this may be a conflict between statically linked vcruntime140.dll and dynamically linked. More likely to occur with msvcp140.dll in the mix, since CPython does not include it.

    I'm not seeing full paths for the dependent DLLs, so I have no way to know whether it's mixing and matching versions. CPython installs include vcruntime140.dll, which should be the one loaded by python.exe. However, there are some extension modules that statically link what is required from it, and that should lead to multiple FLS calls when the embedded CRT initialises/uninitialises its state. That will happen during DLL load, but normally fails with a more obvious error than what you're getting.

    However, if you've got a different version msvcp140.dll, it might be managing to load a mismatched vcruntime140.dll. Or something that's happening during initialisation (which should include C++ stuff) is failing in a way that doesn't bubble out properly.

    Looking through "C:\Program Files (x86)\Windows Kits\10\Source\10.0.19041.0\ucrt\internal\initialization.cpp" (on my machine at least, yours may be different), there's *a lot* that could fail.

    If you're able to coerce a _jpype build that links to vcruntime140d.dll/msvcp140d.dll (the debug versions, typically specified with /MDd instead of /MD to the compiler - might require patching setuptools though), and use a CPython debug build, you may get a more informative failure. If not, I can try and find someone from the C++ libraries team at Microsoft to help out, but that's not going to happen until early January at this point of the year.

    zooba commented 3 years ago

    Looking at the winapi_thunks.cpp source file from the CRT, it may be that LCIDToLocaleName is failing to be loaded for some reason. That's the one that comes after LCMapStringEx, so if it's failing, that could abort the whole process.

    Alternatively, LCMapStringEx might be being called immediately after loading it and failing, which would explain why we don't see a failed LoadLibrary entry (which I'd expect if we're just loading lots of entry points here). I don't see any calls in the CRT itself, so perhaps during initialization of something in the C++ runtime? Might be due to your locale settings?

    More ideas than answers, I'm afraid.

    cbbc42d0-1c72-4dd1-93d4-8553b09ac1c0 commented 3 years ago

    Msvcp140.dll from what I can find is part of the VS 2015 Redstributable package.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    Eryk,

    Unfortunately, this particular bug is resistant to attaching the debugger. Whenever the debugger is attached the bug disappears. Fortunately the suggestion to use Detours by WildCard65 appears to offer a way to diagnose the issue.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    I looked more at the logs to see if I can find out what to instrument next. The log files unfortunately don't diff well because every line contains a timestamp so I can't a proper alignment as well as all the real addresses. So I wrote a short program to convert all the numbers into an "X".

    After doing so I find there is a single difference in the process between loading from a py file and a pyc file.

    ##### import _jpype ## Only if loaded from a Py file trclnk64: 001 GetProcAddress(7ffc4c710000,FlsGetValue) X X X trclnk64: 001 GetProcAddress(,) -> X ## All X X X trclnk64: 001 LoadLibraryExW(C:\Users\nelson85\AppData\Local\Programs\Python\Python39\python3,0,1000) X X X trclnk64: 001 LoadLibraryExW(,,) -> X X X X trclnk64: ### X C:\Users\nelson85\AppData\Local\Programs\Python\Python39\python3.DLL 000140bc #### So here is the call that fails.... X X X trclnk64: 001 LoadLibraryExW(c:\users\nelson85\documents\devel\open\jpype\_jpype.cp39-win_amd64.pyd,0,1100) X X X trclnk64: 001 LoadLibraryExW(api-ms-win-core-synch-l1-2-0,0,800) X X X trclnk64: 001 LoadLibraryExW(,,) -> X X X X trclnk64: ### X C:\WINDOWS\System32\KERNELBASE.dll 002b306b X X X trclnk64: ntdll.dll [7ffc4c974138 X

    So the problem is something that happened before the LoadLibraryExW call when python3.dll was accessed. The correct path called FlsGetValue while the incorrect one skipped it which lead to a failure.

    Thus my conclusion is that the vehicle left the edge of the cliff just prior to LoadLibraryExW(_jpype) and hit the ocean when C++ runtime library was requested.

    Does that point to anything we can use to solve this?

    cbbc42d0-1c72-4dd1-93d4-8553b09ac1c0 commented 3 years ago

    I recommend first doing a capture of these functions first, incase Windows is routing some through them:

    LoadLibrary(Ex)(W|A)

    W is the Unicode variant while A is the Ascii variant.

    zooba commented 3 years ago

    I recommend first doing a capture of these functions first

    I don't think that'll be necessary - the logging output ought to be enough. Capturing or redirecting those functions would be great if the debugger was working, but since that spoils it, it's not.

    Does that point to anything we can use to solve this?

    It would be great to know what's calling FlsGetValue there. Obviously something within the CRT, since CPython doesn't use fibers anywhere. I guess it's probably some kind of locale settings that are only necessary when opening a file in text mode? But I don't know that we do that specially here.

    Is there an option to capture full stack traces?

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    I will look through the list of samples in Detours to see if there is something that can give us better stacktrace information. If we really need a stack trace I can make Detours bust by creating an intentional segfault and see if we can attach the debugger after the source of the problem has been triggered. Unfortunately as my windows programming knowledge is all from 1994 I may be missing something obvious.

    In the meantime I am upgrading the sample with some more hooks for any other potential LoadLibrary variants and FlsGetValue. Though given the GetProcAddress calls I think that it must be something in the chain leading up so the variants are likely not envolved.

    The other thing I can do is look at instrumenting SetErrorMode so that I can give myself a marker of when we are entering and leaving the dyload code.

    
            /* Don't display a message box when Python can't load a DLL */
            old_mode = SetErrorMode(SEM_FAILCRITICALERRORS);
    
            /* bpo-36085: We use LoadLibraryEx with restricted search paths
               to avoid DLL preloading attacks and enable use of the
               AddDllDirectory function. We add SEARCH_DLL_LOAD_DIR to
               ensure DLLs adjacent to the PYD are preferred. */
            Py_BEGIN_ALLOW_THREADS
            hDLL = LoadLibraryExW(wpathname, NULL,
                                  LOAD_LIBRARY_SEARCH_DEFAULT_DIRS |
                                  LOAD_LIBRARY_SEARCH_DLL_LOAD_DIR);
            Py_END_ALLOW_THREADS
    #if !USE_UNICODE_WCHAR_CACHE
            PyMem_Free(wpathname);
    #endif /* USE_UNICODE_WCHAR_CACHE */
    
            /* restore old error mode settings */
            SetErrorMode(old_mode);

    I will post back once I get some results.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    Gave another series of shots at the problem. So the reason that attaching the debugger fixes the issue is that it calls LoadLibraryExW for a bunch libraries when it attaches.

    I was also able to unconditionally trigger the bug even when loading without the pyc. It appears that calling "import _ssh" will cause whatever conflict is happening to trigger immediately.

    Other than than progress was minimal. Attaching the debugger causes so many changes to the control flow that I can't get the bad path to trigger and I can't get the debugger to attach once you start down the path. I tried creating a segfault but again I can't get the debugger to attach because Detours is interfering with the attachment process.

    I also tried disabling some of the flags that JPype was adding to the python build process to see if any are related. "/DEBUG" on the linker does not do anything to the problem. The other options were '/Zi', '/EHsc', '/std:c++14'. Nothing here is out of the ordinary.

    I do not see anything in the link line that shows evidence of static linkage or likely to cause a conflict. The exact same line is being used with Python 3.8.

    link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\nelson85\AppData\Local\Programs\Python\Python39\libs /LIBPATH:C:\Users\nelson85\AppData\Local\Programs\Python\Python39\PCbuild\amd64 /LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\ATLMFC\lib\x64 /LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\lib\x64 /LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\lib\um\x64 /LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.18362.0\ucrt\x64 /LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.18362.0\um\x64 /LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\ATLMFC\lib\x64 /LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\lib\x64 /LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\lib\um\x64 /LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.18362.0\ucrt\x64 /LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.18362.0\um\x64 /EXPORT:PyInit__jpype $(OBJECTS) /OUT:build\lib.win-amd64-3.9\_jpype.cp39-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.9\Release\native\common\_jpype.cp39-win_amd64.lib

    So I remain at a loss how to proceed forward. Nothing has changed in JPype or its symbols since Python 3.8 so this is clearly taking place in Python or at the OS level. Is it possible that Python shipped with bad dlls for the 3.9.0 build? If so then looking at the source code may be the wrong approach. After all depending on the order of libraries I trigger or don't trigger the bug.

    I should note that the difference between loading successfully py and pyc seems to have been triggered in PC/getpathc.c .... could it be the change to remove loading api-ms-win-core-path?

     static void
     join(wchar_t *buffer, const wchar_t *stuff)
     {
    -    if (_PathCchCombineEx_Initialized == 0) {
    -        HMODULE pathapi = LoadLibraryExW(L"api-ms-win-core-path-l1-1-0.dll", NULL,
    -                                         LOAD_LIBRARY_SEARCH_SYSTEM32);
    -        if (pathapi) {
    -            _PathCchCombineEx = (PPathCchCombineEx)GetProcAddress(pathapi, "PathCchCombineEx");
    -        }
    -        else {
    -            _PathCchCombineEx = NULL;
    -        }
    -        _PathCchCombineEx_Initialized = 1;
    -    }
    -
    -    if (_PathCchCombineEx) {
    -        if (FAILED(_PathCchCombineEx(buffer, MAXPATHLEN+1, buffer, stuff, 0))) {
    -            Py_FatalError("buffer overflow in getpathp.c's join()");
    -        }
    -    } else {
    -        if (!PathCombineW(buffer, buffer, stuff)) {
    -            Py_FatalError("buffer overflow in getpathp.c's join()");
    -        }
    +    if (FAILED(PathCchCombineEx(buffer, MAXPATHLEN+1, buffer, stuff, 0))) {
    +        Py_FatalError("buffer overflow in getpathp.c's join()");
         }
     }

    Also given JPype is really a generic C++ CPython module (with no special options other than /Zi /EHsc why hasn't anyone else triggered this?

    eryksun commented 3 years ago

    I built JPype normally in release mode and was able to analyze the problem using the platform debugger (cdb or WinDbg). I think the issue is pretty straight forward. The static initializer for classMagic in "native\python\pyjp_class.cpp" is calling PyDict_New(). That's not kosher because the GIL isn't held during the LoadLibraryExW call. It seems to be okay without a cached PYC because the new_dict() call has an available dict on the freelist in this case. In the PYC case, new_dict() happens to instead call PyObject_GC_New, which fails with an access violation. I modified "pyjp_class.cpp" to set the static value of classMagic to NULL and moved the initialization to PyJPClass_new. This resolved the problem. So I think this issue should be closed as a third-party bug.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    Drat I missed that one when I was scanning for statics because I was focusing C++ resource rather than looking for Python resources. I also wouldn't have expected this to happen on only one platform, but that was my bad.

    Is it possible to improve the error reporting for this? This was a horrible error to trace down at least for a non-windows programmer. At least a mention in the doc as to the symptoms so someone else can google it.

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    I found it. The change that hit JPype is https://bugs.python.org/issue33895

    Thanks, Eryk Sun for figuring this out. I would never have gotten myself.

    zooba commented 3 years ago

    I doubt there's anything more we can do about Windows swallowing an access violation and turning it into a generic error. There's a very low chance you'd have found any notes about it in the docs, so this bug will probably stand as the best available reference for future.

    We also can't really add more checks for every method that assumes the GIL is held, because that will hurt runtime performance for everyone. And I doubt even debug checks would help here - DLL load is pretty well protected, and there's really nothing but a debugger that can diagnose issues anyway.

    I've pushed back in the past on adding a "how to debug apps on Windows" section to our docs, because it is such a huge topic that we can't possibly cover well enough to be useful in all the cases that come up.

    "Don't use functions that require the GIL without holding the GIL" is already documented generally enough. So I'm glad we got this figured out, but I'm not sure there's anything we can fix long-term to prevent it (other than entirely replacing the C API ;) ).

    8d2ceb9f-caae-40fc-87bd-93204db558b1 commented 3 years ago

    Okay, well at least now googling Python + "A dynamic link library (DLL) initialization routine failed." give something which could point a user may be able to identify the issue. It wasn't obvious to me that imports did not hold the GIL, but it is clear in retrospect that it should have been.

    Thanks very much and sorry for the wild goose chase.