llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.33k stars 12.13k forks source link

[lldb][tests] Sockets leaks in API tests with a remote target #118032

Open slydiman opened 2 hours ago

slydiman commented 2 hours ago

We got unexpected errors on a random single test on lldb-remote-linux-ubuntu and lldb-remote-linux-win 1-4 times per day.

Error 1: Unresolved some test with the exception failed to create a socket to the launched debug monitor after 20 tries. Usually we got this error on the Linux host (lldb-remote-linux-ubuntu), e.g. TestGdbRemoteMemoryAllocation.py, TestNonStop.py, TestGdbRemoteSingleStep.py. But we saw the same error (very rarely) on Windows host (lldb-remote-linux-win) too: TestGdbRemoteHostInfo.py.

Error 2: 600 seconds timeout. Usually (99%) we got this error on the Windows host (lldb-remote-linux-win) with the test TestModuleLoadedNotifys.py and less often with any other test, e.g. TestLldbGdbServer.py. We also saw the same error (very rarely) on Linux host (lldb-remote-linux-ubuntu) too: TestCancelAttach.py.

I believe that the cause of both issues is the same - leaking sockets.

Error 1 is raised in connect_to_debug_monitor() in gdbremote_testcase.py. It uses a random port 12000 + random.randint(0, 3999) to launch a new instance of lldb-server gdbserver *:port on the target. Then it tries to connect to the lldb-server up to 10 times with 0.5 sec delay and terminates the lldb-server if connection failed. Then it tries another port up to 20 times with a random delay 1-5 seconds to avoid collisions.

We checked netstat during the tests in the beginning and got 164 connections in the state TIME_WAIT between the host and the target: 24 connections to target IP:1234 (platform) 100 connections to target IP:43107 (gdbserver) 40 connections to target IP with a random port and 2 connections in the state ESTABLISHED.

We checked netstat during the tests after 15 minutes and got 641 connections in the state TIME_WAIT between the host and the target 310 connections to target IP:1234 (platform) 331 connections to target IP:43107 (gdbserver) and 9 connections in the state ESTABLISHED.

Both buildbots run tests in 8 threads.

Both buildbots use python 3.12. Note the results with python 3.13 are worse probably due to an incremental GC. The average build/test time with python 3.13 is longer.

Probably increasing MAX_ATTEMPTS = 20 in connect_to_debug_monitor() may be enough to fix the error 1. But I have no idea how to fix and even debug the error 2. It is very hard to reproduce.

llvmbot commented 2 hours ago

@llvm/issue-subscribers-lldb

Author: Dmitry Vasilyev (slydiman)

We got unexpected errors on a random single test on [lldb-remote-linux-ubuntu](https://lab.llvm.org/buildbot/#/builders/195) and [lldb-remote-linux-win](https://lab.llvm.org/staging/#/builders/197) 1-4 times per day. Error 1: Unresolved some test with the exception `failed to create a socket to the launched debug monitor after 20 tries`. Usually we got this error on the Linux host (lldb-remote-linux-ubuntu), e.g. [TestGdbRemoteMemoryAllocation.py](https://lab.llvm.org/buildbot/#/builders/195/builds/1660), [TestNonStop.py](https://lab.llvm.org/buildbot/#/builders/195/builds/1625), [TestGdbRemoteSingleStep.py](https://lab.llvm.org/buildbot/#/builders/195/builds/1614). But we saw the same error (very rarely) on Windows host (lldb-remote-linux-win) too: [TestGdbRemoteHostInfo.py](https://lab.llvm.org/staging/#/builders/197/builds/732). Error 2: 600 seconds timeout. Usually (99%) we got this error on the Windows host (lldb-remote-linux-win) with the test [TestModuleLoadedNotifys.py](https://lab.llvm.org/staging/#/builders/197/builds/890) and less often with any other test, e.g. [TestLldbGdbServer.py](https://lab.llvm.org/staging/#/builders/197/builds/744). We also saw the same error (very rarely) on Linux host (lldb-remote-linux-ubuntu) too: [TestCancelAttach.py](https://lab.llvm.org/buildbot/#/builders/195/builds/1402). I believe that the cause of both issues is the same - leaking sockets. Error 1 is raised in connect_to_debug_monitor() in gdbremote_testcase.py. It uses a random port `12000 + random.randint(0, 3999)` to launch a new instance of `lldb-server gdbserver *:port` on the target. Then it tries to connect to the lldb-server up to 10 times with 0.5 sec delay and terminates the lldb-server if connection failed. Then it tries another port up to 20 times with a random delay 1-5 seconds `to avoid collisions`. We checked netstat during the tests in the beginning and got 164 connections in the state TIME_WAIT between the host and the target: 24 connections to `target IP`:1234 (platform) 100 connections to `target IP`:43107 (gdbserver) 40 connections to `target IP` with a random port and 2 connections in the state ESTABLISHED. We checked netstat during the tests after 15 minutes and got 641 connections in the state TIME_WAIT between the host and the target 310 connections to `target IP`:1234 (platform) 331 connections to `target IP`:43107 (gdbserver) and 9 connections in the state ESTABLISHED. Both buildbots run tests in 8 threads. Both buildbots use python 3.12. Note the results with python 3.13 are worse probably due to an incremental GC. The average build/test time with python 3.13 is longer. Probably increasing `MAX_ATTEMPTS = 20` in connect_to_debug_monitor() may be enough to fix the error 1. But I have no idea how to fix and even debug the error 2. It is very hard to reproduce.
slydiman commented 2 hours ago

@labath, do you have any thoughts/suggestions?