Windows: slirp segfault

rakslice commented 3 years ago

In B2, a frequent slirp segfault is

#0  0x0053738e in soread (so=so@entry=0x831ef38) at ../slirp/socket.c:104
#1  0x00534f9c in slirp_select_poll (readfds=readfds@entry=0x881fc0c, writefds=writefds@entry=0x881fd10, xfds=xfds@entry=0x881fe14) at ../slirp/slirp.c:412
#2  0x00410284 in slirp_receive_func (arg=0x0) at ether_windows.cpp:1392

here initializing at the top of soread:

104             int mss = so->so_tcpcb->t_maxseg;

with so->so_tcpcb 0 or 0xfeeefeee (MS C runtime placeholder value indicating the heap at the location has already been freed, so in this instance presumably the struct socket pointed to by so has already been freed.)

I don't have an exact procedure to cause this segfault 100% of the time, but doing a process of launching B2, running Netscape 2 in Mac OS 7.6.1, with its stock Open Transport 1.1.1, and visiting and browsing around a couple of sites, which includes

a local python SimpleHTTPServer that I use to serve my classic mac files
system7today.com

and then shutting down, I usually reproduce the problem in two or three boots.

[Update: A more reliable way to hit the bug is something that definitely closes a TCP connection locally on the emulated Mac, such as closing a telnet connection in NCSA Telnet]

rakslice commented 3 years ago

At this point in slirp_select_poll, looking at the "Check TCP sockets" for loop we are inside, at the top of the loop the first thing it does is cache the so 1 ahead after the current one into so_next.

    /*
     * Check sockets
     */
    if (link_up) {
        /*
         * Check TCP sockets
         */
        for (so = tcb.so_next; so != &tcb; so = so_next) {
            so_next = so->so_next;

Since I just wrote code precisely like this in the loop that rearranges the drive queue, ;) I'm going to presume that this is a cheap trick to get around the fact that the code within the loop in some case needs to modify the list while it is iterating through it, i.e. it needs to take the current node out of the list and wants to not lose its place when it does that.

However if, in the course of iterating through the list, there is a possibility of the next node after the current one incidentally being modified in such a way that it becomes freed or otherwise so_tcpcb-invalid, and it is taken out of the list to avoid a problem, then so_next is actually a liability because it will cause the iteration to still visit it.

rakslice commented 3 years ago

For the case where so->so_tcpcb is 0, it is being set to 0 during tcp_close.

There are more than one ways into tcp_close:

From within slirp_select_poll cleaning up sockets:

#0  tcp_close (tp=0x81ea840) at ../slirp/tcp_subr.c:282
#1  0x00533e04 in tcp_slowtimo () at ../slirp/tcp_timer.c:94
#2  0x00535152 in slirp_select_poll (readfds=readfds@entry=0x883fc0c,
    writefds=writefds@entry=0x883fd10, xfds=xfds@entry=0x883fe14)
    at ../slirp/slirp.c:374
#3  0x00410284 in slirp_receive_func (arg=0x0) at ether_windows.cpp:1392
#4  0x76726cff in msvcrt!_beginthreadex () from C:\WINDOWS\System32\msvcrt.dll
#5  0x76726dc1 in msvcrt!_endthreadex () from C:\WINDOWS\System32\msvcrt.dll
#6  0x764ffa29 in KERNEL32!BaseThreadInitThunk ()
   from C:\WINDOWS\System32\kernel32.dll
#7  0x772975f4 in ntdll!RtlGetAppContainerNamedObjectPath ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#8  0x772975c4 in ntdll!RtlGetAppContainerNamedObjectPath ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#9  0x00000000 in ?? ()

From input. This comes from a different thread (!)

(gdb) bt
#0  tcp_close (tp=tp@entry=0x81ccf00) at ../slirp/tcp_subr.c:294
#1  0x0052e57d in tcp_input (m=m@entry=0x81ca0a8, iphlen=<optimized out>,
    iphlen@entry=20, inso=inso@entry=0x0) at ../slirp/tcp_input.c:978
#2  0x0053610b in ip_input (m=0x81ca0a8) at ../slirp/ip_input.c:222
#3  0x005353a6 in slirp_input (pkt=<optimized out>, pkt_len=72)
    at ../slirp/slirp.c:637
#4  0x004108d2 in ether_thread_write_packets (arg=0x0)
    at ether_windows.cpp:1021
#5  0x76726cff in msvcrt!_beginthreadex () from C:\WINDOWS\System32\msvcrt.dll
#6  0x76726dc1 in msvcrt!_endthreadex () from C:\WINDOWS\System32\msvcrt.dll
#7  0x764ffa29 in KERNEL32!BaseThreadInitThunk ()
   from C:\WINDOWS\System32\kernel32.dll
#8  0x772975f4 in ntdll!RtlGetAppContainerNamedObjectPath ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#9  0x772975c4 in ntdll!RtlGetAppContainerNamedObjectPath ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#10 0x00000000 in ?? ()

rakslice commented 3 years ago

When I run into the segfault in the slirp_receive_func thread with so->so_tcpcb equal 0, what the ether_thread_write_packets thread is doing at the time is that, as hinted at by tcp_close being used from multiple threads, it is past the so->so_tcpcb = 0 line and not yet to sofree(so):

104             int mss = so->so_tcpcb->t_maxseg;
(gdb) bt
#0  0x0053738e in soread (so=so@entry=0xcdc5e40) at ../slirp/socket.c:104
#1  0x00534f9c in slirp_select_poll (readfds=readfds@entry=0x883fc0c,
    writefds=writefds@entry=0x883fd10, xfds=xfds@entry=0x883fe14)
    at ../slirp/slirp.c:412
#2  0x00410284 in slirp_receive_func (arg=0x0)
    at C:/msys32/home/Andrew Tonner/src/macemu/BasiliskII/src/Windows/ether_windows.cpp:1420
#3  0x76726cff in msvcrt!_beginthreadex () from C:\WINDOWS\System32\msvcrt.dll
#4  0x76726dc1 in msvcrt!_endthreadex () from C:\WINDOWS\System32\msvcrt.dll
#5  0x764ffa29 in KERNEL32!BaseThreadInitThunk ()
   from C:\WINDOWS\System32\kernel32.dll
#6  0x772975f4 in ntdll!RtlGetAppContainerNamedObjectPath ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#7  0x772975c4 in ntdll!RtlGetAppContainerNamedObjectPath ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#8  0x00000000 in ?? ()
(gdb) thread 9
[Switching to thread 9 (Thread 28060.0x6574)]
#0  0x77278b7a in ntdll!RtlGetCurrentServiceSessionId ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x77278b7a in ntdll!RtlGetCurrentServiceSessionId ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x257ecf32 in ?? ()
#2  0x0cdc5f80 in ?? ()
#3  0x772b621d in ntdll!RtlGetNtGlobalFlags ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#4  0x0cdc5f80 in ?? ()
#5  0x77278786 in ntdll!RtlFreeHeap () from C:\WINDOWS\SYSTEM32\ntdll.dll
#6  0x7730f687 in ntdll!RtlRegisterSecureMemoryCacheCallback ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#7  0x772c67e0 in ntdll!RtlCaptureStackContext ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#8  0x772b621d in ntdll!RtlGetNtGlobalFlags ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#9  0x0cdc5f80 in ?? ()
#10 0x77278786 in ntdll!RtlFreeHeap () from C:\WINDOWS\SYSTEM32\ntdll.dll
#11 0x76707379 in msvcrt!free () from C:\WINDOWS\System32\msvcrt.dll
#12 0x02090000 in ?? ()
#13 0x005318b0 in tcp_close (tp=tp@entry=0xcdc5ec0) at ../slirp/tcp_subr.c:303
#14 0x0052e54d in tcp_input (m=m@entry=0xcdbcd00, iphlen=<optimized out>,
    iphlen@entry=20, inso=inso@entry=0x0) at ../slirp/tcp_input.c:977
#15 0x00535f8b in ip_input (m=0xcdbcd00) at ../slirp/ip_input.c:222
#16 0x00535226 in slirp_input (pkt=<optimized out>, pkt_len=72)
    at ../slirp/slirp.c:628
#17 0x004108d2 in ether_thread_write_packets (arg=0x0)
    at C:/msys32/home/Andrew Tonner/src/macemu/BasiliskII/src/Windows/ether_windows.cpp:1037
#18 0x76726cff in msvcrt!_beginthreadex () from C:\WINDOWS\System32\msvcrt.dll
#19 0x76726dc1 in msvcrt!_endthreadex () from C:\WINDOWS\System32\msvcrt.dll
#20 0x764ffa29 in KERNEL32!BaseThreadInitThunk ()
   from C:\WINDOWS\System32\kernel32.dll
#21 0x772975f4 in ntdll!RtlGetAppContainerNamedObjectPath ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#22 0x772975c4 in ntdll!RtlGetAppContainerNamedObjectPath ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#23 0x00000000 in ?? ()
(gdb) f 13
#13 0x005318b0 in tcp_close (tp=tp@entry=0xcdc5ec0) at ../slirp/tcp_subr.c:303
303             sbfree(&so->so_snd);
(gdb) list
298             /* clobber input socket cache if we're closing the cached connection */
299             if (so == tcp_last_so)
300                     tcp_last_so = &tcb;
301             closesocket(so->s);
302             sbfree(&so->so_rcv);
303             sbfree(&so->so_snd);
304             sofree(so);
305             tcpstat.tcps_closed++;
306             return ((struct tcpcb *)0);
307     }

rakslice commented 3 years ago

I imagine that there is some way to a fix this specific problem where we

reorder tcp_close so that the socket being closed is taken out of active processing (e.g. removed from the list) first, not left in, before it is invalidated in such a way that causes undefined behaviour in other functions trying to use it
provide some kind of safeguard against tcp_close of a socket happening while the select poll tcp for loop is visiting that socket
deal with the for loop's so_next potentially becoming invalid because of a close from the other thread

rakslice commented 3 years ago

But if slirp already doesn't have a kind of critical section it uses, this is a big hint that it's not designed to have its functions called concurrently on multiple threads.

rakslice commented 3 years ago

Testing in Linux, weirdly all the tcp_close calls come from tcp_slowtimo...

That's a little strange :D

rakslice commented 3 years ago

Okay, so in ether_unix, the emulated machine is connected to slirp with pipes for some reason. The emulated machine's call to send a packet just puts it into the pipe.

slirp_input(), which is called for the packets inbound to slirp from the emulated machine, in ether_unix is actually done inside the slirp_receive_func thread. It alternates between selects for slirp's sockets and a select on the inbound packets pipe, which has its own set of possible consequences, but at least guarantees that only one slirp function call can be in progress at once.

That means that this bug doesn't apply to Unix.

But the question of why in practice no sockets close due to tcp_input is still unanswered.

rakslice commented 3 years ago

Linux: Ah, this is just down to normal browser behaviour, I guess; a quick test of closing a session from the local side in NCSA Telnet does call slirp_input as expected.

rakslice commented 3 years ago

Running B2 in Windows, a more likely way to hit the bug than doing things in the browser is closing a telnet session in NCSA Telnet locally (i.e. from the client side)

rakslice / macemu

Windows: slirp segfault #39