Closed kiplingw closed 1 year ago
Try to compile pistacheio and your code with tsan
and reproduce the issue there.
It's hard to replicate, but from what I can see it may not be a threading issue. The SSL_accept
is hanging. That other fellow seems to be experiencing a very similar problem.
@dennisjenkins75, do you know where we are setting the timeout value, if any, for SSL_accept
? We create an SSL context here. I've looked in the ssl_create_context
but we don't appear to be setting any options via SSL_CTX_set_*
.
I've managed to reproduce the problem again recently (or rather, something else did). The same Pistache listener thread seems to be hung in BIO_read
during a call to Pistache::Tcp::Listener::handleNewConnection()
.
The file descriptor 4
that is being fed into its call to __GI___libc_read
was a useful bread crumb trail as I dove into /proc
by trying to read from some hints in /proc/<pid>/fdinfo/4
:
pos: 0
flags: 02000002
mnt_id: 9
ino: 469996
The fdinfo
interface is described in proc(5)
here. The pos
is supposed to be the offset into the stream. It would appear as though a connection was opened and nothing was received from the client.
The flags
appears to be a combination of O_CLOEXEC | O_RDWR
.
When I try to look for the mnt_id
9 in /proc/<pid>/mountinfo
, there does not appear to be any mount containing this file, unless I'm reading it wrong:
23 1 202:1 / / rw,relatime shared:1 - ext4 /dev/root rw,discard,errors=remount-ro
24 23 0:5 / /dev rw,relatime shared:2 - devtmpfs devtmpfs rw,size=2000720k,nr_inodes=500180,mode=755,inode64
25 23 0:21 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
26 23 0:22 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs rw
27 26 0:6 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - securityfs securityfs rw
28 24 0:23 / /dev/shm rw,nosuid,nodev shared:3 - tmpfs tmpfs rw,inode64
29 24 0:24 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts rw,gid=5,mode=620,ptmxmode=000
30 23 0:25 / /run rw,nosuid,nodev shared:11 - tmpfs tmpfs rw,size=803020k,nr_inodes=819200,mode=755,inode64
31 30 0:26 / /run/lock rw,nosuid,nodev,noexec,relatime shared:12 - tmpfs tmpfs rw,size=5120k,inode64
32 26 0:27 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
33 26 0:28 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:9 - pstore pstore rw
34 26 0:29 / /sys/fs/bpf rw,nosuid,nodev,noexec,relatime shared:10 - bpf bpf rw,mode=700
35 25 0:30 / /proc/sys/fs/binfmt_misc rw,relatime shared:13 - autofs systemd-1 rw,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=14761
36 24 0:31 / /dev/hugepages rw,relatime shared:14 - hugetlbfs hugetlbfs rw,pagesize=2M
37 24 0:19 / /dev/mqueue rw,nosuid,nodev,noexec,relatime shared:15 - mqueue mqueue rw
38 26 0:7 / /sys/kernel/debug rw,nosuid,nodev,noexec,relatime shared:16 - debugfs debugfs rw
39 26 0:12 / /sys/kernel/tracing rw,nosuid,nodev,noexec,relatime shared:17 - tracefs tracefs rw
40 26 0:32 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime shared:18 - fusectl fusectl rw
41 26 0:20 / /sys/kernel/config rw,nosuid,nodev,noexec,relatime shared:19 - configfs configfs rw
63 30 0:33 / /run/credentials/systemd-sysusers.service ro,nosuid,nodev,noexec,relatime shared:20 - ramfs none rw,mode=700
87 23 7:0 / /snap/amazon-ssm-agent/5656 ro,nodev,relatime shared:43 - squashfs /dev/loop0 ro,errors=continue
90 23 7:1 / /snap/amazon-ssm-agent/6312 ro,nodev,relatime shared:45 - squashfs /dev/loop1 ro,errors=continue
96 23 7:3 / /snap/core18/2667 ro,nodev,relatime shared:49 - squashfs /dev/loop3 ro,errors=continue
99 23 7:4 / /snap/core20/1738 ro,nodev,relatime shared:51 - squashfs /dev/loop4 ro,errors=continue
102 23 7:5 / /snap/core20/1778 ro,nodev,relatime shared:53 - squashfs /dev/loop5 ro,errors=continue
105 23 7:6 / /snap/lxd/24239 ro,nodev,relatime shared:55 - squashfs /dev/loop6 ro,errors=continue
108 23 7:8 / /snap/snapd/17950 ro,nodev,relatime shared:57 - squashfs /dev/loop8 ro,errors=continue
114 23 7:9 / /snap/snapd/17883 ro,nodev,relatime shared:61 - squashfs /dev/loop9 ro,errors=continue
117 35 0:34 / /proc/sys/fs/binfmt_misc rw,nosuid,nodev,noexec,relatime shared:63 - binfmt_misc binfmt_misc rw
705 30 0:25 /snapd/ns /run/snapd/ns rw,nosuid,nodev - tmpfs tmpfs rw,size=803020k,nr_inodes=819200,mode=755,inode64
730 705 0:4 mnt:[4026532280] /run/snapd/ns/lxd.mnt rw - nsfs nsfs rw
341 23 7:10 / /snap/lxd/24323 ro,nodev,relatime shared:242 - squashfs /dev/loop10 ro,errors=continue
111 30 0:40 / /run/user/1001 rw,nosuid,nodev,relatime shared:59 - tmpfs tmpfs rw,size=401508k,nr_inodes=100377,mode=700,uid=1001,gid=1001,inode64
112 23 7:7 / /snap/core18/2679 ro,nodev,relatime shared:60 - squashfs /dev/loop7 ro,errors=continue
If I take a look at the ino
(I think this is the socket's inode ID) in the TCP dump table at /proc/<pid>/task/<pid>/net/tcp
, I find ino 469996:
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
(...)
118: <XXXXXXXX>:1B5B 464BD93C:A04E 01 00000000:00000000 00:00000000 00000000 115 0 469996 1 ffff99339b62cec0 82 0 0 10 -1
(...)
If I re-arrange this information so that it is more readable, I get:
sl | local_address | rem_address | st | tx_queue | : | rx_queue | tr | : | tm->when | retrnsmt | uid | timeout | inode |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
118: | XXXXXXXX:1B5B | 464BD93C:A04E | 01 | 00000000 | : | 00000000 | 00 | : | 00000000 | 00000000 | 115 | 0 | 469996 1 ffff99339b62cec0 82 0 0 10 -1 |
The st
field is the internal status of the socket according to `proc(5) and these helpful notes. This appears to correspond to TCP_ESTABLISHED = 1.
The transmit and receive buffers appear to be empty. There does not appear to be a timeout.
The remote address 464BD93C
is 60.217.75.70
, which appears to be an IP address in China known to perform uninvited SSH handshakes, port scanning, and other abuses complained about thousands of times.
This is somewhat worrying that a remote client merely connecting to a Pistache listener thread can leave it hanging. This could be a DOS vulnerability in either OpenSSL or Pistache, unless I've done something wrong in my code. I realize I can simply ban the IP address, but that's not really a "fix".
@dennisjenkins75 or @Tachi107, any sagely wisdom? I have preserved a core dump of the service.
Scan a publicly open pistacheio SSL server with https://www.ssllabs.com/ssltest/.
Find an equivalent open source SSL scanner that can be ran locally. We probably should fuzz-test pistachio.
As for debugging the random sample that you have, I have no advice. I don't have time right now to dig deep into it. I would advise that we attempt to find a repro of the issue, using some OSS SSL vulnerability/fuzz tester, file bugs, and then work to fix them.
Thanks @dennisjenkins75. I tried an SSL scanner and the only issue it identified was that my certificate was self-signed (shouldn't be an issue). Everything else seemed fine.
I did, however, finally find out how to reproduce the problem quite easily. All I have to do is open a connection (e.g. via telnet) to a Pistache server, and then initiate a shutdown of the service. It hangs as long as the connection remains open. I will dig in further tomorrow.
@dennisjenkins75, I created a PR that appears to fix the issue at my end. I didn't, however, have time to figure out how to add another unit test. @Tachi107, your eyes welcome too.
Fixed by 101ee7cc96c7b10b12156388c41fe3bed6f878fa
This may be related to #708 or #761, but I've noticed an issue recently with one of my servers that appears to hang at the
acceptThread.join()
call here:This appears to happen occasionally on destruction via
Pistache::Http::Endpoint::~Endpoint()
. TheacceptThread.join()
is just a wrapper forstd::thread::join()
, of course.I have two separate core dumps of the same process. One before I asked it to terminate gracefully via SIGTERM signal and the other after issuing the signal. In both I see a Pistache listener thread that appears to be hanging in OpenSSL's
BIO_read()
. I'm pretty sure this is the problem because in the coredump post SIGTERM after all other threads, besides main, have exited, still remains:The
acceptThread.join()
is probably deadlocked waiting for that listener thread to finish what it's doing. The../src/server/listener.cc:481
is the call toSSL_accept(ssl_data)
. So the handshake appears to hang every now and then, rather than timing out.This then raises the question of what is making
SSL_accept(ssl_data)
hang indefinitely? This may have always been a problem, but perhaps something in my network topology my server sits in recently made this problem become evident. Perhaps the workaround is to specify a timeout?This person's issue seems very similar to mine in that as soon as the hang occurs the server will not accept any new connections.