tidwall / neco

Concurrency library for C (coroutines)
MIT License
1.1k stars 79 forks source link

Segfault in the HTTP server example code #1

Closed j4cobgarby closed 4 months ago

j4cobgarby commented 4 months ago

To reproduce

1) Download neco.{c,h} and the example HTTP server code from the README, which is:

#include <stdio.h>
#include <unistd.h>
#include "neco.h"

void request(int argc, void *argv[]) {
    int fd = *(int*)argv[0];
    char req[256];
    int n = neco_read(fd, req, sizeof(req));
    if (n > 0) {
        char res[] = "HTTP/1.0 200 OK\r\n"
                     "Content-Type: text/html\r\n"
                     "Content-Length: 21\r\n"
                     "\r\n"
                     "<h1>Hello Neco!</h1>\n";
        neco_write(fd, res, sizeof(res));
    }
    close(fd);
}

int neco_main(int argc, char *argv[]) {
    int servfd = neco_serve("tcp", "127.0.0.1:8080");
    if (servfd < 0) {
        printf("neco_serve: %s\n", neco_strerror(servfd));
        return 0;
    }
    printf("Serving at http://127.0.0.1:8080\n");
    while (1) {
        int fd = neco_accept(servfd, 0, 0);
        if (servfd < 0) {
            printf("neco_accept: %s\n", neco_strerror(fd));
            continue;
        }
        neco_start(request, 1, &fd);
    }
    return 0;
}

2) gcc neco.c main.c && ./a.out 3) It runs fine, and serves the HTML once, but segfaults immediately after.

$ ./a.out
Serving at http://127.0.0.1:8080
Segmentation fault (core dumped)

Debugging

I had a brief look at the trace in GDB:

Serving at http://127.0.0.1:8080
[New Thread 0x7ffff56006c0 (LWP 30212)]
[New Thread 0x7ffff4c006c0 (LWP 30214)]

Thread 2 "a.out" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff56006c0 (LWP 30212)]
0x00007ffff7c93c34 in pthread_detach@GLIBC_2.2.5 () from /lib64/libc.so.6
(gdb) where
#0  0x00007ffff7c93c34 in pthread_detach@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x0000000000405876 in worker_entry (arg=0x41f9f0) at neco.c:2744
#2  0x00007ffff7c92bb2 in start_thread () from /lib64/libc.so.6
#3  0x00007ffff7d1400c in clone3 () from /lib64/libc.so.6
(gdb) frame 1
#1  0x0000000000405876 in worker_entry (arg=0x41f9f0) at neco.c:2744
2744                    pthread_detach(thread->th);
(gdb) print thread
$1 = (struct worker_thread *) 0x41f9f0
(gdb) print thread->th
$2 = 0

Presumably this thread->th == 0 is the cause of the problem, but I'm not familiar enough with the codebase to have a look into this any further.

Probably worth noting that other examples that I tried so far worked fine. (Nevermind, looks like I'm getting the same issue with the echo server in examples/; it segfaults before I even run the client, in fact.)

j4cobgarby commented 4 months ago

Looking a bit further into the source code, I've been able to fix this issue with the following patch:

diff --git a/neco.c b/neco.c
index 322265c..51d8110 100644
--- a/neco.c
+++ b/neco.c
@@ -2739,10 +2739,10 @@ static void *worker_entry(void *arg) {
         ts.tv_sec += 1;
         pthread_cond_timedwait(&thread->cond, &thread->mu, &ts);
         if (thread->len == 0) {
-            thread->th = 0;
             if (!thread->end) {
                 pthread_detach(thread->th);
             }
+            thread->th = 0;
             thread->end = false;
             break;
         }

(i.e., moving the thread->th = 0; until after the call to pthread_detach)

Let me know if this is a "correct" fix, or if the previous order was intended. If so, I'd be happy to submit a PR :)

tidwall commented 4 months ago

Your code looks like it will fix an error related to the pthread_deatch being passed a zero value. But I'm unable to reproduce an actual SEGV crash on my side.
What operating system and architecture are you running?

j4cobgarby commented 4 months ago

I'm running Opensuse Tumbleweed on Linux 6.8.1. In terms of architecture, it's some dell laptop with an Intel i7. It is a strange issue, because clearly the code works fine for you, I just don't understand how -- maybe certain implementations of pthread_detach gracefully handle invalid threads?

j4cobgarby commented 4 months ago

I wonder if the difference is due to how our systems are giving IDs to threads. If I run the following code:

#include <pthread.h>
#include <stdio.h>

void *test(void *_) {
return NULL;
}

int main() {
    pthread_t th;
    pthread_create(&th, NULL, &test, NULL);

    printf("Created thread: %lu\n", th);

    pthread_detach(th);
}

I see "Created thread: 140478894835392", but depending on what OS you're on, the first thread could potentially be given the id 0? This would explain why you're not getting a segfault, however I believe in that case it's still incorrect to explicitly set the thread to 0 before detaching it.

tidwall commented 4 months ago

I agree that it must be related to a different in how systems handles passing a zero to pthread_detach. I would expect it to return an ESRCH error, and gracefully continue operating. But still, you identified a valid issue with thread being set to zero before detaching. I think that will probably fix the issue.

I would accept a PR if you are willing.

j4cobgarby commented 4 months ago

That's a good point, the man page agrees with you that it should return a ESRCH error... However even with something as simple as

#include <pthread.h>

int main() {
    pthread_detach(0);
}

I get a segfault - maybe this is a bug specifically in whatever version of libc I have.

I'll submit a PR in a minute :)

tidwall commented 4 months ago

Great thanks! All is merged.

j4cobgarby commented 4 months ago

thanks :)