nanomsg / nng

nanomsg-next-generation -- light-weight brokerless messaging
https://nng.nanomsg.org
MIT License
3.85k stars 492 forks source link

nng_listen occasionally fails when reusing ipc socket on Windows #1175

Open jeikabu opened 4 years ago

jeikabu commented 4 years ago

NNG & Platform details.

nng v1.2.4 Win10

Expected Behavior

After calling nng_close() on a socket listening on "ipc://", can immediately open another socket and listen on same pipe.

Actual Behavior

Works on Linux. On Windows, nng_listen() fails with EADDRINUSE after 10-500x iterations.

Steps to Reproduce

int
main(int argc, char** argv) {
    nng_socket socket0, socket1;
    int i;
    const char* url = "ipc://test_socket";
    for (int i = 0; i < 1000; i++) {
        assert(nng_pair0_open(&socket0) == 0);
        assert(nng_listen(socket0, url, NULL, 0) == 0);
        assert(nng_pair0_open(&socket1) == 0);
        assert(nng_dial(socket1, url, NULL, 0) == 0);
        assert(nng_close(socket1) == 0);
        assert(nng_close(socket0) == 0);
    }

    return 0;
}
gdamore commented 4 years ago

I'm assuming you're using actual Windows and not WSL.

IPC on Windows is implemented using named pipes. I wonder if the OS has some limitation here -- I don't think I'm doing anything particularly asynchronously -- but I will look into it.

gdamore commented 4 years ago

I suspect that what is happening is that while I'm calling DisconnectNamedPipe -- you might have clients on the far side that are still holding that point in the "namespace" active. This is different from how it works on UNIX or with TCP. If this is actually what is happening, I'm not sure I'll be able to do much about it.

gdamore commented 4 years ago

Docs about this are here: https://docs.microsoft.com/en-us/windows/win32/api/namedpipeapi/nf-namedpipeapi-disconnectnamedpipe

Essentially you have to wait for the clients to see the pipe error, and close the pipe.

This does mean that a malicious client can hold the pipe indefinitely, and prevent future server instances from listening. This is a design defect in Named Pipes in Windows, IMO. (But again, Windows wasn't designed for multi-user use, really, and this API reflects that -- mostly processes are assumed to be cooperative and not actively hostile.)

jeikabu commented 4 years ago

Yes, this is actual Windows, not WSL. Tried looking into this and this seems to be what's going on:

Under normal circumstances, e.g. pair1_pipe_recv_cb receives an aio error and calls nni_pipe_close.(). nni_pipe_close() kicks off nni_reap(&p->p_reap, (nni_cb) pipe_destroy, p) and when it gets into ipctran_pipe_fini(), refcnt drops to 0. Then nni_listener_reap() -> nni_listener_destroy() and ipctran_ep_fini() cleans everything up.

Things go awry if nng_close() kicks off nni_reap(&l->l_reap, (nni_cb) nni_listener_reap, l) before pipe_destroy. In this case when it gets to ipctran_ep_fini, refcnt != 0 and things aren't cleaned up. Next iteration around the loop CreateNamedPipe() fails with EADDRINUSE.

In the above sample, you can prevent this by delaying nng_close() with nng_msleep(10) or whatever.

gdamore commented 4 years ago

Your last comment is relevant. I'll have to think hard about properly fixing this safely -- it probably means we need to change the way we tear down the listener -- these pipes probably need to add to the reference count on the listener, so that the listener isn't fully discarded until the last pipe closes.

This is quite unlike POSIX, where the service address (the file name) is only used at rendezvous, and the two sockets can happily stay connected without impacting the ability to delete and create the socket in the file system.

I might not get to this for 1.3, but only because I'm trying hard to get a 1.3 out the door soonish.

gdamore commented 2 years ago

Closing both sockets should clean this up properly. I'm still trying to figure out what resource is left behind.

gdamore commented 2 years ago

Actually I may have found it... the conn_aio should have been closed too.

gdamore commented 2 years ago

I've published what I think might be a fix for this (I can see circumstances where possibly this impacts TCP as well btw.)

Please have a look at my fix1175 branch and give it a shot. It's not easy for me to test Windows right this second (because I don't have any decent toolchain for Windows ARM).

alzix commented 2 years ago

tested the changes proposed in #1562 on Windows 10 x64 - the issue persists