lldb-server's GDB port map (--min/max-gdbserver-port) has a race condition when killing the debugee on a slow remote system

DavidSpickett commented 2 months ago

To reproduce, the remote needs to be something relatively slow. In my case QEMU.

Start an lldb-server on the remote in platform mode with some restricted ports:

$ ./lldb-server plaform --server --listen 0.0.0.0:54321 --min-gdbserver-port 49140 --max-gdbserver-port 49150

Then connect lldb and run a program:

$ ./bin/lldb
(lldb) platform select remote-linux
  Platform: remote-linux
 Connected: no
(lldb) platform connect connect://127.0.0.1:54321
  Platform: remote-linux
    Triple: aarch64-unknown-linux-gnu
OS Version: 6.10.0 (6.10.0-rc4-g14d7c92f8df9)
  Hostname: e125016
 Connected: yes
WorkingDir: /home/davspi01
    Kernel: #1 SMP PREEMPT Tue Jun 18 15:14:36 BST 2024
(lldb) target create /tmp/test.o
Current executable set to '/tmp/test.o' (aarch64).
(lldb) b main
Breakpoint 1: where = test.o`main at test.c:1:21, address = 0x0000000000400574
(lldb) run
Process 263 launched: '/tmp/test.o' (aarch64)
Process 263 stopped
* thread #1, name = 'test.o', stop reason = breakpoint 1.1
    frame #0: 0x0000000000400574 test.o`main at test.c:1:21
-> 1    int main() { return 0; }

At this point lldb-server has started a new process to handle this client's connection and given it a port map that has one open port, that it has used to start a gdb-server process.

From here, if you finish the program then run it again, everything is fine. The gdbserver is torn down and the port is freed, then reused for the new gdbserver.

However, if you run before the program finishes there is a race between the platform killing the gdbserver process and the platform handling the launch gdb server request packet:

$ ./bin/lldb
(lldb) platform select remote-linux
  Platform: remote-linux
 Connected: no
(lldb) platform connect connect://127.0.0.1:54321
  Platform: remote-linux
    Triple: aarch64-unknown-linux-gnu
OS Version: 6.10.0 (6.10.0-rc4-g14d7c92f8df9)
  Hostname: e125016
 Connected: yes
WorkingDir: /home/davspi01
    Kernel: #1 SMP PREEMPT Tue Jun 18 15:14:36 BST 2024
(lldb) target create /tmp/test.o
Current executable set to '/tmp/test.o' (aarch64).
(lldb) b main
Breakpoint 1: where = test.o`main at test.c:1:21, address = 0x0000000000400574
(lldb) run
Process 263 launched: '/tmp/test.o' (aarch64)
Process 263 stopped
* thread #1, name = 'test.o', stop reason = breakpoint 1.1
    frame #0: 0x0000000000400574 test.o`main at test.c:1:21
-> 1    int main() { return 0; }
(lldb) run
There is a running process, kill it and restart?: [Y/n] y
Process 263 exited with status = 9 (0x00000009) killed
error: unable to launch a GDB server on 'e125016'

We can see that the packets are sent in the right order:

lldb             <   5> send packet: $k#6b
lldb             <  19> read packet: $X09;process:115#cc
Process 277 exited with status = 9 (0x00000009) killed
<...>
lldb             <  34> send packet: $qLaunchGDBServer;host:e125016;#12
lldb             <   7> read packet: $E09#ae
error: unable to launch a GDB server on 'e125016'

On the server side it uses kill() to kill the process and it should then free the port here: https://github.com/llvm/llvm-project/blob/76c84e702bd9af7db2bb9373ba6de0508f1e57a9/lldb/tools/lldb-server/lldb-platform.cpp#L300

(introduced by https://github.com/llvm/llvm-project/pull/88845, which I think made the feature overall better, but in doing so exposed this issue)

If the remote is slow enough that the kill doesn't actually finish before we get back there, then it won't and it'll try to find a free port, find none, and fail to launch the gdb server. Some ad-hoc logging shows this on the server side:

265: port map begins --------------------   <<< 256 is the main process
first: 49140 second: 0
first: 49141 second: 0
first: 49142 second: 0
first: 49143 second: 0
first: 49144 second: 0
first: 49145 second: 0
first: 49146 second: 0
first: 49147 second: 0
first: 49148 second: 0
first: 49149 second: 0
265: port map ends--------------------
266: port map begins -------------------- <<< 266 is created to handle the connection to lldb
first: 49140 second: 0
266: port map ends--------------------   <<< the first gdbserver is launched on the only unallocated port, 49140
266: port map begins --------------------
first: 49140 second: 271
266: port map ends-------------------- <<< looks for a free port, but the 1 port is allocated to pid 271 still
266: No free port found in map!  <<< error returned to lldb
266: FreePortForProcess 271 49140 <<< it finally notices that kill() has happened

This does not happen debugging locally on real hardware.

I discovered this running some of the SVE tests again. They run the debugee once to discover the supported vector lengths and then again to run the actual test case. Adding a sleep(5) in the test cases between those 2 runs also "fixes" the issue.

The workaround is to not use a port map at all, but often giving full network access to a VM is difficult. So I'd like to find a way to make this work, or at least work around it in the tests most likely to be run this way.

llvmbot commented 2 months ago

@llvm/issue-subscribers-lldb

Author: David Spickett (DavidSpickett)

To reproduce, the remote needs to be something relatively slow. In my case QEMU. Start an lldb-server on the remote in platform mode with some restricted ports: ``` $ ./lldb-server plaform --server --listen 0.0.0.0:54321 --min-gdbserver-port 49140 --max-gdbserver-port 49150 ``` Then connect lldb and run a program: ``` $ ./bin/lldb (lldb) platform select remote-linux Platform: remote-linux Connected: no (lldb) platform connect connect://127.0.0.1:54321 Platform: remote-linux Triple: aarch64-unknown-linux-gnu OS Version: 6.10.0 (6.10.0-rc4-g14d7c92f8df9) Hostname: e125016 Connected: yes WorkingDir: /home/davspi01 Kernel: #1 SMP PREEMPT Tue Jun 18 15:14:36 BST 2024 (lldb) target create /tmp/test.o Current executable set to '/tmp/test.o' (aarch64). (lldb) b main Breakpoint 1: where = test.o`main at test.c:1:21, address = 0x0000000000400574 (lldb) run Process 263 launched: '/tmp/test.o' (aarch64) Process 263 stopped * thread #1, name = 'test.o', stop reason = breakpoint 1.1 frame #0: 0x0000000000400574 test.o`main at test.c:1:21 -> 1 int main() { return 0; } ``` At this point lldb-server has started a new process to handle this client's connection and given it a port map that has one open port, that it has used to start a gdb-server process. From here, if you finish the program then run it again, everything is fine. The gdbserver is torn down and the port is freed, then reused for the new gdbserver. However, if you run before the program finishes there is a race between the platform killing the gdbserver process and the platform handling the launch gdb server request packet: ``` $ ./bin/lldb (lldb) platform select remote-linux Platform: remote-linux Connected: no (lldb) platform connect connect://127.0.0.1:54321 Platform: remote-linux Triple: aarch64-unknown-linux-gnu OS Version: 6.10.0 (6.10.0-rc4-g14d7c92f8df9) Hostname: e125016 Connected: yes WorkingDir: /home/davspi01 Kernel: #1 SMP PREEMPT Tue Jun 18 15:14:36 BST 2024 (lldb) target create /tmp/test.o Current executable set to '/tmp/test.o' (aarch64). (lldb) b main Breakpoint 1: where = test.o`main at test.c:1:21, address = 0x0000000000400574 (lldb) run Process 263 launched: '/tmp/test.o' (aarch64) Process 263 stopped * thread #1, name = 'test.o', stop reason = breakpoint 1.1 frame #0: 0x0000000000400574 test.o`main at test.c:1:21 -> 1 int main() { return 0; } (lldb) run There is a running process, kill it and restart?: [Y/n] y Process 263 exited with status = 9 (0x00000009) killed error: unable to launch a GDB server on 'e125016' ``` We can see that the packets are sent in the right order: ``` lldb < 5> send packet: $k#6b lldb < 19> read packet: $X09;process:115#cc Process 277 exited with status = 9 (0x00000009) killed <...> lldb < 34> send packet: $qLaunchGDBServer;host:e125016;#12 lldb < 7> read packet: $E09#ae error: unable to launch a GDB server on 'e125016' ``` On the server side it uses `kill()` to kill the process and it should then free the port here: https://github.com/llvm/llvm-project/blob/76c84e702bd9af7db2bb9373ba6de0508f1e57a9/lldb/tools/lldb-server/lldb-platform.cpp#L300 (introduced by https://github.com/llvm/llvm-project/pull/88845, which I think made the feature overall better, but in doing so exposed this issue) If the remote is slow enough that the kill doesn't actually finish before we get back there, then it won't and it'll try to find a free port, find none, and fail to launch the gdb server. Some ad-hoc logging shows this on the server side: ``` 265: port map begins -------------------- <<< 256 is the main process first: 49140 second: 0 first: 49141 second: 0 first: 49142 second: 0 first: 49143 second: 0 first: 49144 second: 0 first: 49145 second: 0 first: 49146 second: 0 first: 49147 second: 0 first: 49148 second: 0 first: 49149 second: 0 265: port map ends-------------------- 266: port map begins -------------------- <<< 266 is created to handle the connection to lldb first: 49140 second: 0 266: port map ends-------------------- <<< the first gdbserver is launched on the only unallocated port, 49140 266: port map begins -------------------- first: 49140 second: 271 266: port map ends-------------------- <<< looks for a free port, but the 1 port is allocated to pid 271 still 266: No free port found in map! <<< error returned to lldb 266: FreePortForProcess 271 49140 <<< it finally notices that kill() has happened ``` This does not happen debugging locally on real hardware. I discovered this running some of the SVE tests again. They run the debugee once to discover the supported vector lengths and then again to run the actual test case. Adding a `sleep(5)` in the test cases between those 2 runs also "fixes" the issue. The workaround is to not use a port map at all, but often giving full network access to a VM is difficult. So I'd like to find a way to make this work, or at least work around it in the tests most likely to be run this way.

slydiman commented 2 months ago

I have investigated this issue little bit.

The variable port is initialized with 0 and passed to LaunchGDBServer() https://github.com/llvm/llvm-project/blob/0870afaaaccde5b4bae37abfc982207ffafb8332/lldb/tools/lldb-server/lldb-platform.cpp#L355-L359 It seems the condition if (!port) is never met because std::optional<uint16_t> port has a value. It is a bug and must be fixed! https://github.com/llvm/llvm-project/blob/0870afaaaccde5b4bae37abfc982207ffafb8332/lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerPlatform.cpp#L160-L169 Then port = 0 is always passed to StartDebugserverProcess() called from GDBRemoteCommunicationServerPlatform::LaunchGDBServer(). But StartDebugserverProcess() does not use the passed port value anyway and updates the port with a new value (child_port). https://github.com/llvm/llvm-project/blob/76e37b1a08906620537440ebcd5162697079cba5/lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunication.cpp#L1159-L1171 Note m_port_map.AssociatePortWithProcess(*port, pid) will silently fail here because a new port value is missing in portmap_for_child. https://github.com/llvm/llvm-project/blob/7b135f7c0881ef0718c5c83e4d8556c5fdb32d86/lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerPlatform.cpp#L209-L221 gdbserver_portmap is useless because it does not reflect the actual port used. StartDebugserverProcess() will try to listen on 127.0.0.1:0 only if url is nullptr, but url is always defined if the protocol is tcp. Otherwise pipes are used (binding to port zero). Here is the log of lldb-server processes (parent and child) on remote Linux

/home/ubuntu/lldb-server p --log-channels lldb all --listen *:1234 --server --min-gdbserver-port 1236 --max-gdbserver-port 1240
/home/ubuntu/lldb-server gdbserver tcp://[10.1.1.170]:0 --native-regs --pipe 6

Note the port 0 in the url tcp://[10.1.1.170]:0 is a bug now. but any port in this url will be ignored.

I don't see where --min-gdbserver-port, --max-gdbserver-port and --gdbserver-port values are really used. Do we still need them? --port-offset is not used too.

Probably it is better to revert #88845 since the port mapping does not work as expected anyway. But #88845 caused test failures on cross builds.

DavidSpickett commented 1 month ago

I don't see where --min-gdbserver-port, --max-gdbserver-port and --gdbserver-port values are really used. Do we still need them? --port-offset is not used too.

Start from lldb/tools/lldb-server/lldb-platform.cpp instead. They should be doing something if the user provided them. They are supposed to be for situations where you don't have all the ports available between the host and VM.

They've always been a bit dodgy but they are useful for working on simulators.

slydiman commented 1 month ago

Currently the port in url is always 0 because std::optional<uint16_t> port is initialized here and LaunchGDBServer() does not request port map at all https://github.com/llvm/llvm-project/blob/0870afaaaccde5b4bae37abfc982207ffafb8332/lldb/tools/lldb-server/lldb-platform.cpp#L355

Note usually it is better to configure simulators (qemu) to use a network bridge with own IP w/o any port limits and do not map ports manually to host's ports.

Note I'm working on a patch to fix this issue and all port mapping related issues. The idea is to use threads for platform mode and use a common port map for all connections instead of making a new portmap_for_child with 1 port.

llvm / llvm-project

lldb-server's GDB port map (--min/max-gdbserver-port) has a race condition when killing the debugee on a slow remote system #97537