nanomsg / nng

nanomsg-next-generation -- light-weight brokerless messaging
https://nng.nanomsg.org
MIT License
3.63k stars 472 forks source link

Windows a deadlock on `nng_close()` #1827

Closed alzix closed 1 month ago

alzix commented 2 months ago

Describe the bug When calling nng_close() there is sometimes a deadlock which causes nng_close() to hang. This happens also when using only sync APIs (no AIOs).

nng_close should finish successfully.

Actual Behavior

nng_close() hangs.

To Reproduce

I created a modified version of the reqrep example code here (I use it with IPC transport): https://gist.github.com/mikisch81/428c4ad87afcc1c8881b282cd5e80eb3

In the modified example in the client code right before calling nng_recv for the reply from the server I start a thread which just calls nng_close, after a couple of successful runs the deadlock happen:

mischwar@tlv-mpawy reqrep % ./reqrep client ipc:///tmp/reqrep_test 1712848018 - CLIENT: SENDING DATE REQUEST 1712848018 - CLIENT: WAITING FOR REPLY 1712848018 - CLIENT: CLOSING SOCKET nng_recv error - Object closed ... ... 1712848018 - CLIENT: SENDING DATE REQUEST 1712848018 - CLIENT: WAITING FOR REPLY 1712848018 - CLIENT: CLOSING SOCKET nng_recv error - Object closed 1712848018 - CLIENT: SENDING DATE REQUEST 1712848018 - CLIENT: WAITING FOR REPLY 1712848018 - CLIENT: CLOSING SOCKET. <--- The thread is suck in nng_close here Environment Details

NNG version: 1.7.3 Operating system and version: MacOS Sonoma 14.4.1 (but also happens on Windows and Linux) Compiler and language used: C/C++ clang Shared or static library - static

Additional context image

it seems that on Windows it is a TOCTOU issue in the nni_sock_shutdown

it is stuck on

// We have to wait for pipes to be removed.
while (!nni_list_empty(&sock->s_pipes)) {
=>  nni_cv_wait(&sock->s_cv);
}

while s_pipes is already empty image

leowang552 commented 2 months ago

image

SHA-1: aac4dc360faca9ccece487ba16371d20d90e6406

node->**ln_prev** is 0xFFFFFFFFFFFFFFFF

https://gist.github.com/mikisch81/428c4ad87afcc1c8881b282cd5e80eb3 server side vs 2022 debug version sdk 10.0.19041.0 64bit win 10

leowang552 commented 2 months ago
void SetThreadsNng()
{
    nng_init_set_parameter(NNG_INIT_MAX_TASK_THREADS, 1);
    nng_init_set_parameter(NNG_INIT_MAX_EXPIRE_THREADS, 1);

    nng_init_set_parameter(NNG_INIT_MAX_POLLER_THREADS, 1);
    nng_init_set_parameter(NNG_INIT_NUM_RESOLVER_THREADS, 1);
}

add above code for easy debugging

alzix commented 2 months ago

I also reproduced the crash reported by @leowang552 a couple of times

gdamore commented 2 months ago

The crash is a different issue... it's a sign that we're trying to remove from a list when already not on the list.

Let's open another ticket for it. (And I'd like to know whether that happens only on IPC on Windows, or does it happen either on other platforms or for non-IPC connections -- because the Windows named pipe IPC is totally unlike the other transports.)

gdamore commented 2 months ago

As to this bug, I have a PR out that I'd appreciate it if folks could try out.

leowang552 commented 2 months ago

I see the bug not easily reproduced if the computer performance is high. Still not reproduced after running 1 hour. This PC: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz 64G RAM

alzix commented 2 months ago

hi @gdamore, i've tested the proposed changes from #1828 92fd162 and it seems to not solve the issue

image image

gdamore commented 2 months ago

Does this problem only happen with Windows IPC?

gdamore commented 2 months ago

@alzix it seems you actually have an item on the pipes list. This is different than the reported issue before that the pipe list was empty.

Again, I am wondering if this is exclusive to the IPC dialer.

gdamore commented 2 months ago

Ah I see the original problem was reported on macos.

alzix commented 2 months ago

In my project I use nng for IPC this I cannot provide any input on other protocols. I noticed the stuck frame was different on my last check - but the check was based on #1828 pr. Perhaps it solved once issue but uncovered another. Please notice that reqrep server was used for test, this the problem is probably in listener and not in the dialer

gdamore commented 2 months ago

Interesting. Ok, I have pushed another change to the same PR, which attempts to fix a possible issue if the pipes are closed during negotiation (which likely is happening here.)

If you can try that PR branch again, I'd be grateful @alzix .

alzix commented 2 months ago

did - left my insights in PR comments