Error 1: Unresolved some test with the exception failed to create a socket to the launched debug monitor after 20 tries.
Usually we got this error on the Linux host (lldb-remote-linux-ubuntu), e.g. TestGdbRemoteMemoryAllocation.py, TestNonStop.py, TestGdbRemoteSingleStep.py. But we saw the same error (very rarely) on Windows host (lldb-remote-linux-win) too: TestGdbRemoteHostInfo.py.
Error 2: 600 seconds timeout.
Usually (99%) we got this error on the Windows host (lldb-remote-linux-win) with the test TestModuleLoadedNotifys.py and less often with any other test, e.g. TestLldbGdbServer.py. We also saw the same error (very rarely) on Linux host (lldb-remote-linux-ubuntu) too: TestCancelAttach.py.
I believe that the cause of both issues is the same - leaking sockets.
Error 1 is raised in connect_to_debug_monitor() in gdbremote_testcase.py.
It uses a random port 12000 + random.randint(0, 3999) to launch a new instance of lldb-server gdbserver *:port on the target.
Then it tries to connect to the lldb-server up to 10 times with 0.5 sec delay and terminates the lldb-server if connection failed.
Then it tries another port up to 20 times with a random delay 1-5 seconds to avoid collisions.
We checked netstat during the tests in the beginning and got 164 connections in the state TIME_WAIT between the host and the target:
24 connections to target IP:1234 (platform)
100 connections to target IP:43107 (gdbserver)
40 connections to target IP with a random port
and 2 connections in the state ESTABLISHED.
We checked netstat during the tests after 15 minutes and got 641 connections in the state TIME_WAIT between the host and the target
310 connections to target IP:1234 (platform)
331 connections to target IP:43107 (gdbserver)
and 9 connections in the state ESTABLISHED.
Both buildbots run tests in 8 threads.
Both buildbots use python 3.12. Note the results with python 3.13 are worse probably due to an incremental GC. The average build/test time with python 3.13 is longer.
Probably increasing MAX_ATTEMPTS = 20 in connect_to_debug_monitor() may be enough to fix the error 1.
But I have no idea how to fix and even debug the error 2. It is very hard to reproduce.
We got unexpected errors on a random single test on [lldb-remote-linux-ubuntu](https://lab.llvm.org/buildbot/#/builders/195) and [lldb-remote-linux-win](https://lab.llvm.org/staging/#/builders/197) 1-4 times per day.
Error 1: Unresolved some test with the exception `failed to create a socket to the launched debug monitor after 20 tries`.
Usually we got this error on the Linux host (lldb-remote-linux-ubuntu), e.g. [TestGdbRemoteMemoryAllocation.py](https://lab.llvm.org/buildbot/#/builders/195/builds/1660), [TestNonStop.py](https://lab.llvm.org/buildbot/#/builders/195/builds/1625), [TestGdbRemoteSingleStep.py](https://lab.llvm.org/buildbot/#/builders/195/builds/1614). But we saw the same error (very rarely) on Windows host (lldb-remote-linux-win) too: [TestGdbRemoteHostInfo.py](https://lab.llvm.org/staging/#/builders/197/builds/732).
Error 2: 600 seconds timeout.
Usually (99%) we got this error on the Windows host (lldb-remote-linux-win) with the test [TestModuleLoadedNotifys.py](https://lab.llvm.org/staging/#/builders/197/builds/890) and less often with any other test, e.g. [TestLldbGdbServer.py](https://lab.llvm.org/staging/#/builders/197/builds/744). We also saw the same error (very rarely) on Linux host (lldb-remote-linux-ubuntu) too: [TestCancelAttach.py](https://lab.llvm.org/buildbot/#/builders/195/builds/1402).
I believe that the cause of both issues is the same - leaking sockets.
Error 1 is raised in connect_to_debug_monitor() in gdbremote_testcase.py.
It uses a random port `12000 + random.randint(0, 3999)` to launch a new instance of `lldb-server gdbserver *:port` on the target.
Then it tries to connect to the lldb-server up to 10 times with 0.5 sec delay and terminates the lldb-server if connection failed.
Then it tries another port up to 20 times with a random delay 1-5 seconds `to avoid collisions`.
We checked netstat during the tests in the beginning and got 164 connections in the state TIME_WAIT between the host and the target:
24 connections to `target IP`:1234 (platform)
100 connections to `target IP`:43107 (gdbserver)
40 connections to `target IP` with a random port
and 2 connections in the state ESTABLISHED.
We checked netstat during the tests after 15 minutes and got 641 connections in the state TIME_WAIT between the host and the target
310 connections to `target IP`:1234 (platform)
331 connections to `target IP`:43107 (gdbserver)
and 9 connections in the state ESTABLISHED.
Both buildbots run tests in 8 threads.
Both buildbots use python 3.12. Note the results with python 3.13 are worse probably due to an incremental GC. The average build/test time with python 3.13 is longer.
Probably increasing `MAX_ATTEMPTS = 20` in connect_to_debug_monitor() may be enough to fix the error 1.
But I have no idea how to fix and even debug the error 2. It is very hard to reproduce.
We got unexpected errors on a random single test on lldb-remote-linux-ubuntu and lldb-remote-linux-win 1-4 times per day.
Error 1: Unresolved some test with the exception
failed to create a socket to the launched debug monitor after 20 tries
. Usually we got this error on the Linux host (lldb-remote-linux-ubuntu), e.g. TestGdbRemoteMemoryAllocation.py, TestNonStop.py, TestGdbRemoteSingleStep.py. But we saw the same error (very rarely) on Windows host (lldb-remote-linux-win) too: TestGdbRemoteHostInfo.py.Error 2: 600 seconds timeout. Usually (99%) we got this error on the Windows host (lldb-remote-linux-win) with the test TestModuleLoadedNotifys.py and less often with any other test, e.g. TestLldbGdbServer.py. We also saw the same error (very rarely) on Linux host (lldb-remote-linux-ubuntu) too: TestCancelAttach.py.
I believe that the cause of both issues is the same - leaking sockets.
Error 1 is raised in connect_to_debug_monitor() in gdbremote_testcase.py. It uses a random port
12000 + random.randint(0, 3999)
to launch a new instance oflldb-server gdbserver *:port
on the target. Then it tries to connect to the lldb-server up to 10 times with 0.5 sec delay and terminates the lldb-server if connection failed. Then it tries another port up to 20 times with a random delay 1-5 secondsto avoid collisions
.We checked netstat during the tests in the beginning and got 164 connections in the state TIME_WAIT between the host and the target: 24 connections to
target IP
:1234 (platform) 100 connections totarget IP
:43107 (gdbserver) 40 connections totarget IP
with a random port and 2 connections in the state ESTABLISHED.We checked netstat during the tests after 15 minutes and got 641 connections in the state TIME_WAIT between the host and the target 310 connections to
target IP
:1234 (platform) 331 connections totarget IP
:43107 (gdbserver) and 9 connections in the state ESTABLISHED.Both buildbots run tests in 8 threads.
Both buildbots use python 3.12. Note the results with python 3.13 are worse probably due to an incremental GC. The average build/test time with python 3.13 is longer.
Probably increasing
MAX_ATTEMPTS = 20
in connect_to_debug_monitor() may be enough to fix the error 1. But I have no idea how to fix and even debug the error 2. It is very hard to reproduce.