epoll can fire events to wrong descriptor

robertjpayne commented 7 years ago

We're on Ubuntu 16.04 using mostly vanilla libdill with epoll. We've got an issue where if a coroutine is waiting on a file descriptor (in our case it's waiting on fdin via fd_accept) and another coroutine invokes a routine that leads to closing the descriptor via fd_close that some lingering pollset waiters are left around waiting.

We're having an issue with epoll in that because fdclean won't clean a blocked file descriptor the event handlers never actually fire even after the socket is closed. A subsequent piece of our code creates a new TCP server which the operating system issues an identical file descriptor number as the previous closed descriptor and at that point the epoll handlers fire but the notifications are sent to waiters on the new descriptor even though the events are still for the old descriptor.

This mostly seems like a nasty bug in epoll more than libdill but nonetheless it creates a pretty bad situation of getting zombie notifications for a past previously closed descriptor.

Is there any reason fd_close can't forcefully fdclean a descriptor or in some way attempt to mark a generation of "new" events?

sustrik commented 7 years ago

So one coroutine is waiting on a descriptor and another coroutine closes said descriptor, correct?

Given the unclear descriptor ownership, that doesn't sound like something that could possibly work. Descriptors are for sending/receiving data, not for synchronization among coroutines.

What are you trying to achieve?

robertjpayne commented 7 years ago

@sustrik In this case it's a HTTP server:

Start a TCP server in a coroutine which loops forever accepting new connections
Each new connection is run on it's own coroutine

With the first item the TCP server has to be "closed" externally because it'll be waiting on fdin forever. With this bug it doesn't matter how you structure the code the underlying socket can be closed and dangling polling information is left behind until epoll fire's an event.

robertjpayne commented 7 years ago

I've thought about using a channel but it doesn't solve the issue that fd_close closes the socket and leaves dangling cached info around that can result in zombie notifications to a newer descriptor with the same number...

robertjpayne commented 7 years ago

I guess in conclusion this issue should more be about how fd_close calls fdclean and then close on the descriptor.

Depending on situation the fdclean can fail if there is a waiting fdin or fdout which can leave some polling subscriptions in place.

When those subscriptions are left in place but the descriptor is actually closed a new descriptor with the same number may receive zombie events.

sustrik commented 7 years ago

Two possible scenarios for closing a HTTP server coroutine:

The TCP peer (HTTP client) closes the connection. In that case you get an error and can close the fd cleanly.
The server itself wants to close the connection. To do so, it can do hclose() on the HTTP server coroutine in question. If the coroutine is blocked in fdin() it will unblock with ECANCELED error.

For more context see also here: http://libdill.org/structured-concurrency.html

robertjpayne commented 7 years ago

@sustrik this is a more fundamental issue with libdill in that a socket is closed and a file descriptor is left behind with dangling cached state which can spill over to a newly acquired file descriptor.

In regards to the HTTP server the first workaround really isn't the same issue, closing the connected peer connection is fine it's "restarting/stopping" the server connection that is an issue.

I don't think having to cancel the whole coroutine and all it's children is acceptable, we spoke a lot with @raedwulf earlier and he said the intention of hclose on a coroutine was process shutdown not general purpose interrupts, in practice it seemed like this was the case too (our database connection pool was getting destroyed when a coroutine using it was closed). I may be wrong but my experience with using hclose on a coroutine resulted in a process state where you minaswell just exit and restart.

At this point I've tried a bunch of things but it doesn't look like it's possible to wake up a fdin on a listening socket without destroying nearly the entire process.

The private fd_close should at a minimum assert if fdclean is unsuccessful, Because it'd be far better to crash out than have to chase down why zombie polling events are showing up on a new descriptor.

sustrik commented 7 years ago

Sorry, I am still striving to understand the problem.

Why is closing a single coroutine a problem? It's not like to have to shut down the entire process. Also, it's fast (nanoseconds). Also, you can intercept the shutdown signal and clean up whatever needs cleaning up. Thus no resource leaks.

If the problem is a database connection shared between multiple coroutines (or similar) why not simply use reference counting to manage its lifetime?

robertjpayne commented 7 years ago

@sustrik sorry I should try to explain better as per info in #125:

Coroutine "A" is started for a HTTP client connection
Coroutine "A" blocks waiting on fdin for a database connection pool
While Coroutine "A" is blocked the client drops the connection
The parent of Coroutine "A" calls hclose on Coroutine "A"
libdill cancels the operation at step 2 causing https://github.com/sustrik/libdill/blob/master/fd.c#L201 to unblock and return -1
libdill tcp detects the error and puts the tcp connection into a unrecoverable state https://github.com/sustrik/libdill/blob/master/tcp.c#L115

The issue is it's very destructive, if a coroutine is waiting on a TCP event and the coroutine is closed the TCP connection is forcefully destroyed with no opportunity to recover.

I couldn't find a way to isolate the TCP connection away, if my memory serves me right if you use a channel to pass the read/writes to the TCP connection it still gets caught up in the mess.

Normally this isn't an issue but if you're mid database transaction in an HTTP request and the client drops the connection it's really painful to destroy the database connection, it can roll itself back to a known state pretty easily.

Anyways because of this we've refused to really use hclose while coroutines are actually running, we're just trying to find a way to interrupt tcp_accept but it doesn't look like there is really a way to do that.

sustrik commented 7 years ago

Got it.

The reasoning for that design is as follows: When user is receiving data from a socket and gets interrupted by hclose() the system is in ill-defined intermediary state. Say, the user asked for 100 bytes, but only 56 bytes were read so far. Consider what are the receive function could do:

Read additional 45 bytes before interrupting. But that would mean that the termination process could be blocked forever by a malevolent 3rd party (all it has to do is just not to send those additional bytes).
Push the 56 bytes already received back to the socket. But POSIX offers no way to do that. And frankly, it can't even if it willed.
Mark the socket as broken and continue.

What would your option look like? Does your protocol for speaking to the DB even allow for interruption and subsequent recovery at an arbitrary point in the bytestream?

robertjpayne commented 7 years ago

@sustrik Yeah I get it it's really tricky. I'm trying to reproduce a weird edge case I swore we had where even putting the TCP connection behind a channel resulted in it getting hclose sent on it but I think maybe we just were not handling a ECANCELLED error on the TCP connection side properly.

Regardless I still think it should be possible to call hclose on a tcp handle that is being blocked and not have the pollset left in a state where you can get zombie notifications?

sustrik commented 7 years ago

Sure, but you have to do fdclean() to clean up the cache:

rc = fdin(s);
if(rc == -1 && errno == ECANCELED) {
    fdclean(s);
    close(s);
    return;
}

robertjpayne commented 7 years ago

@sustrik that's what I'm saying though if you look at tcp_listener_hclose it calls fd_close which calls fdclean but because a fdin is currently waiting on the descriptor the fdclean fails (silently) while still closing the descriptor which leaves a bunch of epoll subscriptions cached for a descriptor that the application no longer cares about.

If you then fire up a new server it'll re-use the same descriptor and you get zombie events.

I'm not even sure this is limited to epoll kqueue as well does not send events once a file descriptor is closed.

sustrik commented 7 years ago

Aaaaaah. That sounds like a bug. Let's make a simple test program to reproduce...

robertjpayne commented 7 years ago

@sustrik I think this will do the trick:

#include <stdio.h>
#include "libdill.h"

static int server_handle = 0;
static int server_id = 0;

void server() {
    int id = server_id++;

    struct ipaddr addr;
    ipaddr_local(&addr, "127.0.0.1", 5566, IPADDR_IPV4);

    server_handle = tcp_listen(&addr, 128);
    if(server_handle == -1) {
        printf("failed to listen: %i (%i)\n", server_handle, errno);
        return;
    }

    while(id == server_id) {
        struct ipaddr client_addr;
        int client_s = tcp_accept(server_handle, &client_addr, -1);
        if(client_s == -1) {
            printf("failed to accept: %i (%i)\n", client_s, errno);
            return;
        }
        msleep(now() + 2000);
        tcp_close(client_s, -1);
    }
}

void testcase(void) {
    int server1 = go(server);
    msleep(now() + 1000);

    // call hclose which calls tcp_close on the tcp server
    hclose(server_handle);

    msleep(now() + 1000);

    // this server will immediately fail with EBUSY due to a zombie event from the first server
    int server2 = go(server);

    msleep(now() + 5000);

    hclose(server1);
    hclose(server2);
}

But we use libdill in Swift and my pure C is a bit rusty! Should note it still reproduces if the port is different…

This should fire up a server, the first hclose will shut it down fine, but then the second server will just immediately fail with EBUSY because of the zombie events.

sustrik commented 7 years ago

Thanks! I'll have a look after the lunch.

robertjpayne commented 7 years ago

@sustrik no worries -- I updated it to actually not close the coroutines (because that works but then you can't see the bug)

sustrik commented 7 years ago

Ok, I've submitted a patch to the mainline. Can you check now?

robertjpayne commented 7 years ago

Checking now!

robertjpayne commented 7 years ago

@sustrik I think maybe I still didn't explain good enough, the bug arises already when using hclose which already routes through tcp_listener_hclose

robertjpayne commented 7 years ago

Calling hclose fires:

static void tcp_listener_hclose(struct hvfs *hvfs) {
    struct tcp_listener *self = (struct tcp_listener*)hvfs;
    fd_close(self->fd);
    free(self);
}

Which fires:

void fd_close(int s) {
    fdclean(s);
    struct linger lng;
    lng.l_onoff=1;
    lng.l_linger=0;
    setsockopt(s, SOL_SOCKET, SO_LINGER, (void*)&lng, sizeof(lng));
    /* We are not checking the error here. close() has inconsistent behaviour
       and leaking a file descriptor is better than crashing the entire
       program. */
    close(s);
}

But in fd_close the fdclean actually returns an error because there is a waiting fdin (tcp_accept calls fd_accept which in turn waits on fdin) in a different coroutine. It skips over that error and closes the socket anyways.

Once the socket is closed if you open a new server socket the OS can give the process the same file descriptor number but there's still pollset info cached for that descriptor number against the previously closed socket.

robertjpayne commented 7 years ago

I couldn't see any easy way to fix this, I tried to add a terminal flag to dill_pollset_clean which would force it to clean the descriptor regardless of waiters but that breaks more than it fixes…

sustrik commented 7 years ago

I've added an assert to fd_close() to checks the result code from fdclean().

Then I've run the test program you've pasted above.

No failures. Does it fail for you?

robertjpayne commented 7 years ago

@sustrik sorry I should have really actually fully bootstrapped the test case try this one:

#include <stdio.h>
#include <libdill.h>

static int server_handle = 0;
static int server_id = 0;

coroutine void server() {
    int id = ++server_id;

    printf("[%i] server did start\n", id);

    struct ipaddr addr;
    ipaddr_local(&addr, "127.0.0.1", 5566, IPADDR_IPV4);

    server_handle = tcp_listen(&addr, 128);
    if(server_handle == -1) {
        printf("[%i] failed to listen: %i (%i)\n", id, server_handle, errno);
        return;
    }

    while(id == server_id) {
        struct ipaddr client_addr;
        printf("[%i] will accept\n", id);
        int client_s = tcp_accept(server_handle, &client_addr, -1);
        printf("[%i] did accept\n", id);
        if(client_s == -1) {
            printf("[%i] failed to accept: %i (%i)\n", id, client_s, errno);
            return;
        }
        msleep(now() + 2000);
        tcp_close(client_s, -1);
    }

    printf("[%i] server did end\n", id);
}

void testcase(void) {
    int server1 = go(server());
    msleep(now() + 1000);

    // call hclose which calls tcp_close on the tcp server
    printf("will close server1 handle\n");
    hclose(server_handle);
    printf("did close server1 handle\n");

    msleep(now() + 1000);

    // this server will immediately fail with EBUSY due to a zombie event from the first server
    int server2 = go(server());

    msleep(now() + 5000);

    hclose(server1);
    hclose(server2);
}

int main(int argc, const char * argv[]) {
    testcase();
    return 0;
}

robertjpayne commented 7 years ago

You should see that the second server will fail to accept with EBUSY (-16), if you dig deep that EBUSY is only because the descriptor number has been reused and libdill's cached state thinks there is already a fdin waiter.

sustrik commented 7 years ago

Reproduced with epoll. Thanks!

sustrik commented 7 years ago

Ah, I see. We are back to what we discussed at the beginning: server_handle is closed in testcase() while still being used in server(). That's not going to work. File descriptors can't be used as synchronization primitives.

You can of course kill the entire server() coroutine. What's the problem with that approach?

robertjpayne commented 7 years ago

@sustrik I'm not really trying to use them as synchronisation primitives, all I'm trying to do is close a tcp connection that is currently waiting on more data?

I still think this is a bug -- it can happen on client connections too if you call hclose on the TCP handle (not the file descriptor) when another co-routine is blocked on read/write of it you can end up in this situation.

sustrik commented 7 years ago

Think of it as "fds are not thread-safe" -- you cannot use the same fd from two coroutines at the same time.

The general pattern is to dedicate a coroutine to each fd, then kill the coroutine when you want to close the fd.

So in your case:

1. Start a TCP server in a coroutine which loops forever accepting new connections
2. Each new connection is run on it's own coroutine

I think the solution is to kill the coroutine started in step one istead of trying to close the file descriptor it have created.

robertjpayne commented 7 years ago

Hrm but you explicitly have stated the point of fdin and fdout being separated is one coroutine can be reading while the other is writing no?

I get what you're saying though and I know this would be an incredibly hard thing to change/fix in libdill.

sustrik commented 7 years ago

Fair enough. in & out are treated as separate half-connections and can indeed be used each in its own coroutine. Unfortunately, the abstraction breaks once you want to close the connection, which is a bi-directional operation (packets going back and forth). Thus, before doing close() you have to ensure that no other coroutine is using the fd.

Ugly. I know.

On the other hand, using same fd from two coroutines is probably an extremely rare use case. Like when you want a fast bi-directional TCP proxy or somesuch.

robertjpayne commented 7 years ago

@sustrik we use exactly this for our sync connections over a web socket like connection :(. Is there really no possibility that we can add a terminal flag or something to fdclean / dill_pollset_clean to basically avoid this situation?

I can't decide which is uglier! heh.

sustrik commented 7 years ago

If you are using it via tcp class there should be no problem IMO.

Here's a variant of your test program that works:

#include <assert.h>
#include <stdio.h>
#include "libdill.h"

coroutine void server() {
    struct ipaddr addr;
    int rc = ipaddr_local(&addr, "127.0.0.1", 5566, IPADDR_IPV4);
    assert(rc == 0);
    int  server_handle = tcp_listen(&addr, 128);
    assert(server_handle >= 0);
    int c = tcp_accept(server_handle, NULL, -1);
    if(rc < 0 && errno == ECANCELED) {
        tcp_close(server_handle, -1);
        return;
    }
    // etc.
}

int main(int argc, const char * argv[]) {
    int server1 = go(server());
    msleep(now() + 1000);
    hclose(server1);
    return 0;
}

robertjpayne commented 7 years ago

@sustrik lets look at the more common side of things too with a TCP connection to a remote host, in this example I'm opening a connection and waiting to read data from it when I decide to hclose it.

Only one coroutine is using the client connection yet we end up with the same descriptor issues. I think at a minimum if this is expected behaviour that can only be fixed by calling hclose on the coroutine instead then fd_close should assert if fdclean inside of it fails because if if fdclean fails and the socket is still closed you're going to end up with this weird state of cross over events.

#include <stdio.h>
#include <libdill.h>

static int server_handle = 0;
static int server_id = 0;

static int client_handle = 0;
static int client_id = 0;

coroutine void server() {
    int id = ++server_id;

    printf("[%i] server did start\n", id);

    struct ipaddr addr;
    ipaddr_local(&addr, "127.0.0.1", 5566, IPADDR_IPV4);

    server_handle = tcp_listen(&addr, 128);
    if(server_handle == -1) {
        printf("[%i] failed to listen: %i (%i)\n", id, server_handle, errno);
        return;
    }

    while(id == server_id) {
        struct ipaddr client_addr;
        printf("[%i] will accept\n", id);
        int client_s = tcp_accept(server_handle, &client_addr, -1);
        printf("[%i] did accept\n", id);
        if(client_s == -1) {
            printf("[%i] failed to accept: %i (%i)\n", id, client_s, errno);
            return;
        }
        msleep(now() + 2000);
        tcp_close(client_s, -1);
    }

    printf("[%i] server did end\n", id);
}

coroutine void client() {
    int id = ++client_id;

    printf("[%i] client did start\n", id);

    struct ipaddr addr;
    ipaddr_local(&addr, "127.0.0.1", 5566, IPADDR_IPV4);

    client_handle = tcp_connect(&addr, -1);
    if(client_handle == -1) {
        printf("[%i] failed to connect: %i (%i)\n", id, client_handle, errno);
        return;
    }

    char buf[2048];

    while(id == client_id) {
        printf("[%i] will recv\n", id);
        int rc = brecv(client_handle, buf, 2048, -1);
        printf("[%i] did recv\n", id);
        if(rc == -1) {
            printf("[%i] failed to recv: %i (%i)\n", id, rc, errno);
            return;
        }
        msleep(now() + 2000);
        tcp_close(rc, -1);
    }

    printf("[%i] client did end\n", id);
}

void testcase(void) {
    int server1 = go(server());
    msleep(now() + 1000);

    int client1 = go(client());

    msleep(now() + 1000);

    // call hclose which calls tcp_close on the tcp server
    printf("will close client1 handle\n");
    hclose(client_handle);
    printf("did close client1 handle\n");

    msleep(now() + 1000);

    int client2 = go(client());

    msleep(now() + 5000);
}

int main(int argc, const char * argv[]) {
    testcase();
    return 0;
}

sustrik commented 7 years ago

I've already added the assert to fd_close().

However, if you are not using the same fd from two coroutines at a time you should not see any problems.

This, for example, looks like it's working fine:

#include <assert.h>
#include <stdio.h>
#include "libdill.h"

coroutine void server() {
    struct ipaddr addr;
    int rc = ipaddr_local(&addr, "127.0.0.1", 5566, IPADDR_IPV4);
    assert(rc == 0);
    int server_handle = tcp_listen(&addr, 128);
    assert(server_handle >= 0);
    int c = tcp_accept(server_handle, NULL, -1);
    if(rc < 0 && errno == ECANCELED) {
        tcp_close(server_handle, -1);
        return;
    }
    char buf[256];
    rc = brecv(c, buf, sizeof(buf), -1);
    if(rc < 0 && errno == ECANCELED) {
        tcp_close(server_handle, -1);
        tcp_close(c, -1);
        return;
    }
    assert(rc == 0);
}

int main(int argc, const char * argv[]) {
    int server1 = go(server());

    struct ipaddr addr;
    int rc = ipaddr_local(&addr, "127.0.0.1", 5566, IPADDR_IPV4);
    assert(rc == 0);
    int c = tcp_connect(&addr, -1);
    assert(c >= 0);
    msleep(now() + 1000);
    hclose(server1);
    msleep(now() + 1000);
    return 0;
}

robertjpayne commented 7 years ago

Yea so it's unsafe to close handles that have waiters, instead you either have to wait for the waiters to be done or close enclosing coroutine.

sustrik commented 7 years ago

Correct.

robertjpayne commented 7 years ago

I'll close this for now, I'm going to do a bit more playing around and sorts and see why this never burnt us in libmill but is now with libdill.

sustrik / libdill

epoll can fire events to wrong descriptor #129