llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.25k stars 12.08k forks source link

LLDB TestConcurrentVFork.py is flaky on Linux #85084

Open DavidSpickett opened 8 months ago

DavidSpickett commented 8 months ago

lldb/test/API/functionalities/fork/concurrent_vfork/TestConcurrentVFork.py sometimes fails with a timeout. I've seen this on AArch64 and Arm Linux.

For example https://lab.llvm.org/buildbot/#/builders/17/builds/50450.

TIMEOUT: lldb-api :: functionalities/fork/concurrent_vfork/TestConcurrentVFork.py (2684 of 2684)
******************** TEST 'lldb-api :: functionalities/fork/concurrent_vfork/TestConcurrentVFork.py' FAILED ********************
Script:
--
/usr/bin/python3.8 /home/tcwg-buildbot/worker/lldb-arm-ubuntu/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env ARCHIVER=/usr/local/bin/llvm-ar --env OBJCOPY=/usr/bin/llvm-objcopy --env LLVM_LIBS_DIR=/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./lib --env LLVM_INCLUDE_DIR=/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/include --env LLVM_TOOLS_DIR=/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin --arch armv8l --build-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex --lldb-module-cache-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin/lldb --compiler /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin/clang --dsymutil /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin/dsymutil --llvm-tools-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin --lldb-libs-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./lib /home/tcwg-buildbot/worker/lldb-arm-ubuntu/llvm-project/lldb/test/API/functionalities/fork/concurrent_vfork -p TestConcurrentVFork.py
--
Exit Code: -9
Timeout: Reached timeout of 600 seconds
Command Output (stdout):
--
lldb version 19.0.0git (https://github.com/llvm/llvm-project.git revision e48d5a838f69e0a8e0ae95a8aed1a8809f45465a)
  clang revision e48d5a838f69e0a8e0ae95a8aed1a8809f45465a
  llvm revision e48d5a838f69e0a8e0ae95a8aed1a8809f45465a
--

Inspecting the container afterwards shows that we are using way more PIDs than you'd expect, and we have around 600 processes like:

 tcwg-bu+ 4177936       1  0 Mar11 pts/0    00:00:00 /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/functionalities/fork/concurrent_vfork/TestConcurrentVFork.test_follow_child_vfork_call_exec/a.out

This container persists so this could be the result of the test not cleaning up processes, and them piling up until the system complains, or someone tries to debug the wrong process and gets no response. On AArch64 I have seen it lead to system resource errors as the leftover processes pile up.

I'm going to skip all the tests on these platforms while I look into it.

llvmbot commented 8 months ago

@llvm/issue-subscribers-lldb

Author: David Spickett (DavidSpickett)

`lldb/test/API/functionalities/fork/concurrent_vfork/TestConcurrentVFork.py` sometimes fails with a timeout. I've seen this on AArch64 and Arm Linux. For example https://lab.llvm.org/buildbot/#/builders/17/builds/50450. ``` TIMEOUT: lldb-api :: functionalities/fork/concurrent_vfork/TestConcurrentVFork.py (2684 of 2684) ******************** TEST 'lldb-api :: functionalities/fork/concurrent_vfork/TestConcurrentVFork.py' FAILED ******************** Script: -- /usr/bin/python3.8 /home/tcwg-buildbot/worker/lldb-arm-ubuntu/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env ARCHIVER=/usr/local/bin/llvm-ar --env OBJCOPY=/usr/bin/llvm-objcopy --env LLVM_LIBS_DIR=/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./lib --env LLVM_INCLUDE_DIR=/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/include --env LLVM_TOOLS_DIR=/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin --arch armv8l --build-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex --lldb-module-cache-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin/lldb --compiler /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin/clang --dsymutil /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin/dsymutil --llvm-tools-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin --lldb-libs-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./lib /home/tcwg-buildbot/worker/lldb-arm-ubuntu/llvm-project/lldb/test/API/functionalities/fork/concurrent_vfork -p TestConcurrentVFork.py -- Exit Code: -9 Timeout: Reached timeout of 600 seconds Command Output (stdout): -- lldb version 19.0.0git (https://github.com/llvm/llvm-project.git revision e48d5a838f69e0a8e0ae95a8aed1a8809f45465a) clang revision e48d5a838f69e0a8e0ae95a8aed1a8809f45465a llvm revision e48d5a838f69e0a8e0ae95a8aed1a8809f45465a -- ``` Inspecting the container afterwards shows that we are using way more PIDs than you'd expect, and we have around 600 processes like: ``` tcwg-bu+ 4177936 1 0 Mar11 pts/0 00:00:00 /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/functionalities/fork/concurrent_vfork/TestConcurrentVFork.test_follow_child_vfork_call_exec/a.out ``` This container persists so this could be the result of the test not cleaning up processes, and them piling up until the system complains, or someone tries to debug the wrong process and gets no response. On AArch64 I have seen it lead to system resource errors as the leftover processes pile up. I'm going to skip all the tests on these platforms while I look into it.
DavidSpickett commented 8 months ago

Tests added by https://github.com/llvm/llvm-project/commit/8bdddcf0bb5a40e6ce6cbf7fc6b7ce576e2b032d.

DavidSpickett commented 8 months ago

@jeffreytan81 it's definitely leaving processes behind. Do you see the same thing on your machine?

I wonder if we're following the fork and stopping only the child on exit, or following only one of the children and leaving them and the parent behind.

zeroomega commented 8 months ago

We are seeing similar issues on our Linux x64 and Mac x64 builders. We are chasing a weird lldb test timeout for a couple of days. The LLDB test step will usually finish in 5 mins when LIT uses 60 workers but recently (around Mar 8), this step sometimes took over 1hr until the builder was killed due to timeout. We cannot find the specific test that was stuck as the log shows each time the unfinished tests were different (TestConcurrentVFork is among the tests that PASS actually).

We came across this github issue and decided to have a try to disable TestConcurrentVFork in our builders. Then the time out issue was gone. We suspect this test probably wasn't cleaned up properly after the run and hold some resources that are needed by other tests, causing dead locks, but we cannot verify that.

DavidSpickett commented 7 months ago

@jeffreytan81 ping!

jeffreytan81 commented 7 months ago

@DavidSpickett , sorry, github notification seems to fail me here, I never got notified for the tagging, and my mailbox does not got any emails...

To answer your question, no, we haven't observed the lingering processes issue, but it is possible no one noticed yet. Since @labath is fixing this issue, I will leave as is.

DavidSpickett commented 6 months ago

I've just put the skips back: https://github.com/llvm/llvm-project/commit/0c8151ac809c283187e9b19d0cbe72a09c8d74e0

The test is a lot more stable thanks to Pavel's change, but it's still failing enough to degrade the buildbot results for example https://lab.llvm.org/buildbot/#/builders/96/builds/56699.

DavidSpickett commented 2 months ago

Saw another failure on a GitHub CI run today: https://buildkite.com/llvm-project/github-pull-requests/builds/103712#01922590-fbb5-4b6e-824f-28d2891523a1

_bk;t=1727216861348Timed Out Tests (1):
_bk;t=1727216861348  lldb-api :: functionalities/fork/concurrent_vfork/TestConcurrentVFork.py

Going to disable this on Linux as a whole, the noise in CI isn't worth whatever coverage this is giving us.

DavidSpickett commented 2 months ago

The test is now disabled for all Linux. I will not have the time to figure out a solution here so FYI @jeffreytan81 if this feature is important to you.