Test flake in lldb concurrent break point tests

ilovepi commented 2 weeks ago

We're seeing some LLDB tests flake in our CI. Given these are concurrent tests I assume there is some data race or lack of synchronization.

Flaky tests: lldb-api :: functionalities/thread/concurrent_events/TestConcurrentSignalWatchBreak.py lldb-api :: functionalities/thread/concurrent_events/TestConcurrentSignalNWatchNBreak.py

Bots: https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/lldb-linux-arm64/b8734630228996969777/infra https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/lldb-linux-arm64/b8734618131611235377/overview

Error output:

******************** TEST 'lldb-api :: functionalities/thread/concurrent_events/TestConcurrentSignalWatchBreak.py' FAILED ********************
Script:
--
/b/s/w/ir/x/w/install-cpython-aarch64-linux-gnu/bin/python3 /b/s/w/ir/x/w/llvm-llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env ARCHIVER=/b/s/w/ir/x/w/cipd/clang/bin/llvm-ar --env OBJCOPY=/b/s/w/ir/x/w/cipd/clang/bin/llvm-objcopy --env LLVM_LIBS_DIR=/b/s/w/ir/x/w/llvm_build/./lib --env LLVM_INCLUDE_DIR=/b/s/w/ir/x/w/llvm_build/include --env LLVM_TOOLS_DIR=/b/s/w/ir/x/w/llvm_build/./bin --arch aarch64 --build-dir /b/s/w/ir/x/w/llvm_build/lldb-test-build.noindex --lldb-module-cache-dir /b/s/w/ir/x/w/llvm_build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /b/s/w/ir/x/w/llvm_build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /b/s/w/ir/x/w/llvm_build/./bin/lldb --compiler /b/s/w/ir/x/w/cipd/clang/bin/clang --dsymutil /b/s/w/ir/x/w/llvm_build/./bin/dsymutil --llvm-tools-dir /b/s/w/ir/x/w/llvm_build/./bin --lldb-obj-root /b/s/w/ir/x/w/llvm_build/tools/lldb --lldb-libs-dir /b/s/w/ir/x/w/llvm_build/./lib --skip-category=pexpect /b/s/w/ir/x/w/llvm-llvm-project/lldb/test/API/functionalities/thread/concurrent_events -p TestConcurrentSignalWatchBreak.py
--
Exit Code: 1

Command Output (stdout):
--
lldb version 20.0.0git (https://llvm.googlesource.com/a/llvm-project revision 8ab77184dde2583950fc6e4886ff526e7e598f7e)
  clang revision 8ab77184dde2583950fc6e4886ff526e7e598f7e
  llvm revision 8ab77184dde2583950fc6e4886ff526e7e598f7e
Skipping the following test categories: ['pexpect', 'dsym', 'gmodules', 'debugserver', 'objc']

Watchpoint 1 hit:
old value: 0
new value: 1

--
Command Output (stderr):
--
FAIL: LLDB (/b/s/w/ir/x/w/cipd/clang/bin/clang-aarch64) :: test (TestConcurrentSignalWatchBreak.ConcurrentSignalWatchBreak.test)
======================================================================
FAIL: test (TestConcurrentSignalWatchBreak.ConcurrentSignalWatchBreak.test)
   Test a signal/watchpoint/breakpoint in multiple threads.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/b/s/w/ir/x/w/llvm-llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 148, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/b/s/w/ir/x/w/llvm-llvm-project/lldb/test/API/functionalities/thread/concurrent_events/TestConcurrentSignalWatchBreak.py", line 15, in test
    self.do_thread_actions(
  File "/b/s/w/ir/x/w/llvm-llvm-project/lldb/packages/Python/lldbsuite/test/concurrent_base.py", line 333, in do_thread_actions
    self.assertEqual(
AssertionError: 1 != 2 : Expected 1 stops due to signal delivery, but got 2
Config=aarch64-/b/s/w/ir/x/w/cipd/clang/bin/clang
----------------------------------------------------------------------
Ran 1 test in 3.160s

FAILED (failures=1)

--

********************

https://github.com/llvm/llvm-project/issues/39394 seems to be a similar report. @JDevlieghere is this a known problem?

llvmbot commented 2 weeks ago

@llvm/issue-subscribers-lldb

Author: Paul Kirth (ilovepi)

We're seeing some LLDB tests flake in our CI. Given these are concurrent tests I assume there is some data race or lack of synchronization. Flaky tests: lldb-api :: functionalities/thread/concurrent_events/TestConcurrentSignalWatchBreak.py lldb-api :: functionalities/thread/concurrent_events/TestConcurrentSignalNWatchNBreak.py Bots: https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/lldb-linux-arm64/b8734630228996969777/infra https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/lldb-linux-arm64/b8734618131611235377/overview Error output: ``` ******************** TEST 'lldb-api :: functionalities/thread/concurrent_events/TestConcurrentSignalWatchBreak.py' FAILED ******************** Script: -- /b/s/w/ir/x/w/install-cpython-aarch64-linux-gnu/bin/python3 /b/s/w/ir/x/w/llvm-llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env ARCHIVER=/b/s/w/ir/x/w/cipd/clang/bin/llvm-ar --env OBJCOPY=/b/s/w/ir/x/w/cipd/clang/bin/llvm-objcopy --env LLVM_LIBS_DIR=/b/s/w/ir/x/w/llvm_build/./lib --env LLVM_INCLUDE_DIR=/b/s/w/ir/x/w/llvm_build/include --env LLVM_TOOLS_DIR=/b/s/w/ir/x/w/llvm_build/./bin --arch aarch64 --build-dir /b/s/w/ir/x/w/llvm_build/lldb-test-build.noindex --lldb-module-cache-dir /b/s/w/ir/x/w/llvm_build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /b/s/w/ir/x/w/llvm_build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /b/s/w/ir/x/w/llvm_build/./bin/lldb --compiler /b/s/w/ir/x/w/cipd/clang/bin/clang --dsymutil /b/s/w/ir/x/w/llvm_build/./bin/dsymutil --llvm-tools-dir /b/s/w/ir/x/w/llvm_build/./bin --lldb-obj-root /b/s/w/ir/x/w/llvm_build/tools/lldb --lldb-libs-dir /b/s/w/ir/x/w/llvm_build/./lib --skip-category=pexpect /b/s/w/ir/x/w/llvm-llvm-project/lldb/test/API/functionalities/thread/concurrent_events -p TestConcurrentSignalWatchBreak.py -- Exit Code: 1 Command Output (stdout): -- lldb version 20.0.0git (https://llvm.googlesource.com/a/llvm-project revision 8ab77184dde2583950fc6e4886ff526e7e598f7e) clang revision 8ab77184dde2583950fc6e4886ff526e7e598f7e llvm revision 8ab77184dde2583950fc6e4886ff526e7e598f7e Skipping the following test categories: ['pexpect', 'dsym', 'gmodules', 'debugserver', 'objc'] Watchpoint 1 hit: old value: 0 new value: 1 -- Command Output (stderr): -- FAIL: LLDB (/b/s/w/ir/x/w/cipd/clang/bin/clang-aarch64) :: test (TestConcurrentSignalWatchBreak.ConcurrentSignalWatchBreak.test) ====================================================================== FAIL: test (TestConcurrentSignalWatchBreak.ConcurrentSignalWatchBreak.test) Test a signal/watchpoint/breakpoint in multiple threads. ---------------------------------------------------------------------- Traceback (most recent call last): File "/b/s/w/ir/x/w/llvm-llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 148, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/b/s/w/ir/x/w/llvm-llvm-project/lldb/test/API/functionalities/thread/concurrent_events/TestConcurrentSignalWatchBreak.py", line 15, in test self.do_thread_actions( File "/b/s/w/ir/x/w/llvm-llvm-project/lldb/packages/Python/lldbsuite/test/concurrent_base.py", line 333, in do_thread_actions self.assertEqual( AssertionError: 1 != 2 : Expected 1 stops due to signal delivery, but got 2 Config=aarch64-/b/s/w/ir/x/w/cipd/clang/bin/clang ---------------------------------------------------------------------- Ran 1 test in 3.160s FAILED (failures=1) -- ******************** ``` https://github.com/llvm/llvm-project/issues/39394 seems to be a similar report. @JDevlieghere is this a known problem?

jimingham commented 2 weeks ago

The breakpoint counting in these tests has been flakey but for a known reason (we weren't making the distinction between "thread executed the breakpoint trap" and "the process stopped while this thread happened to have the PC on the trap instruction, but it hasn't executed it yet", which could lead to miscounting breakpoints.

But I'm not sure how you'd get miscounted signals. What the test is actually counting is "number of stops in the debugger where some thread had a stop reason of "signal". The test itself only sends one SIGUSR per signal thread, and the test makes only one signal thread. So either that signal is getting resent - which seems unlikely but signals are weird - and we're legitimately reporting two signal stops, or we are incorrectly preserving the signal stop reason across two stops.

We clear the stop reason for a thread the next time that thread is given a chance to run. We don't know or care whether it actually ran, we clear it when we tell that thread it can run, and then resume the process. However, if we don't allow the thread to run when we resume the process we preserve the stop info, since that really is the last state of that thread...

But in this test the only time we suspend threads is when stepping over breakpoints, we do that by suspending all the other threads and only allowing the breakpoint thread to run one instruction. Then we put the trap back in place and run all threads without returning control to the user. So I can't see a way that that stop - with the preserved signal stop info - could leak to the user.

If we could see the gdb-remote packet log and the lldb step logs for a run that fails this way, we should be able to see at least what the error is.

llvm / llvm-project

Test flake in lldb concurrent break point tests #111583