llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.65k stars 11.84k forks source link

[LLDB] Most tests fail on Windows when built for debug #51272

Open amccarth-google opened 3 years ago

amccarth-google commented 3 years ago
Bugzilla Link 51930
Version unspecified
OS Windows NT
CC @JDevlieghere

Extended Description

A variety of configuration and design decisions have resulted in a situation that, if you build for debug (i.e., -DCMAKE_BUILD_TYPE=Debug) and try to execute the tests on Windows, most of them will fail.

Steps to Reproduce

cmake -GNinja -DCMAKE_BUILD_TYPE=Debug -DLLVM_ENABLE_PROJECTS="clang;lld;lldb" -DLLVM_TARGETS_TO_BUILD=X86 ..\..\llvm-project\llvm

ninja check-lldb

Result

Over 900 of the tests will fail.

Many tests will give a Python stack trace with an access violation or this error:

ImportError: cannot import name '_lldb' from partially initialized module 'lldb' (most likely due to a circular import) (D:\src\llvm\build\ninja_dbg\Lib\site-packages\lldb__init__.py)

(These are actually the best clue as to what's going on.)

For many more tests, the generated output is wrong, so apparent symptom is a filecheck miscompare. (If you try to reproduce these outside of lit, the output will likely be correct and the test will pass. See Workaround for details.)

A few tests hang.

Cause

Two different versions of the Python in the same process, each of which is built against a different version of the C run-time library DLLs.

Here's how this happens:

  1. Ninja starts a process running Lit in the regular Python interpreter.

  2. Lit starts a process to run dotest in Python interpreter.

  3. The dotest.py script imports the lldb module.

  4. The lldb module's SWIG-generated __init__ in turn tries to import _lldb.

  5. In release builds _lldb is a Windows DLL called _lldb.pyd produced from the SWIG bindings. In debug builds, the DLL is called _lldb_d.pyd.

    After SWIG 3.0.9, the template to generate the lldb module's __init__ had to change a bit (because newer versions of SWIG required changes). As a result, the __init__ method no longer distinguishes between _lldb and _lldb_d.

    Our CMake builds originally adapted by creating a filesystem link from _lldb.pyd to the actual _lldb.pyd or _lldb_d.pyd as appropriate. This didn't work reliably (possibly because of differences in the implementations of symlink from GnuWin32 and git).

    Nowadays the correct DLL is copied to _lldb.pyd. Note, however, that the copy and silently fail. See Notes.

  6. Using the now-loaded lldb module to get the SBAPI, dotest.py creates and instance of LLDB (the actual debugger), which runs in the same process as dotest.py.

  7. That LLDB instance has its own statically-linked Python interpreter embedded. Thus the process now has two instances of Python: one running dotest.py and one inside the LLDB instance.

If those two instances don't match, e.g., if one is "release" and the other is "debug", or one is 3.7 and the other 3.8, misery ensures.

Workaround

You can exercise the tests with a "release" build, but you will miss some bugs because release builds disable assertions in core llvm libraries.

For individual dotest.py tests, you can bypass Ninja and Lit and explicitly launch dotest.py in the debug version of Python (i.e., python_d.exe instead of python.exe). For example:

"C:/Program Files/Python38/python_d.exe" \
  D:/src/llvm/llvm-project/lldb\test\API\dotest.py \
  [options elided] \
  -p TestDynamicValue.py

Solution

None found. I recommend we modify our CMake scripts to warn when CMAKE_BUILD_TYPE is Debug, the target platform is Windows, and LLVM_ENABLE_PROJECTS includes lldb.

Notes

Failure to Copy

The copy of either _lldb.pyd or _lldb_d.pyd to _lldb.pyd can fail. In particular, I've seen this happen when a zombie process from a previous test run holds the older file locked. For reasons I haven't discovered, failure of the copy doesn't fail the build. You're left with a previous build of _lldb.pyd, which can make for difficult-to-debug problems.

Python Detection Churn

In the past year or two, we've had a lot of churn in how CMake finds Python for llvm generally and for specifically for lldb. In hindsight, I think a lot of the problems I experienced with those changes were because the test process ended up with two different versions of Python, each linked against a different version of the CRT, even when they were both release builds.

vasily-v-ryabov commented 1 year ago

I've faced almost the same issue, but earlier. In the Debug solution configuration I used swig==4.1.1 in an installed Python 3.11.0 (not built from source) and generated VS 2022 solution with LLDB Python scripting enabled. The build is failed because python311_d.lib is not found (of course, it's not installed).

Another interesting thing is that MLIR Python bindings use pybind11==2.10.3 successfully with python311.lib despite it is Debug LLVM build. I didn't dive into both LLDB & MLIR Python bindings deep enough yet. Maybe migration from SWIG to pybind11 makes sense, maybe not. Of course, it can reduce number of dependencies for the whole LLVM project, but risks were not analyzed. MLIR Python bindings are still not official and not posted to PyPI. So all this stuff requires plenty of work.