nanomsg / nng

nanomsg-next-generation -- light-weight brokerless messaging
https://nng.nanomsg.org
MIT License
3.78k stars 485 forks source link

macOS - fix deadlock on reqrep socket close #1824

Closed alzix closed 5 months ago

alzix commented 5 months ago

when an aio has no a_cancel_fn and the task is in task_prepstate abort it on nni_aio_stop call

fixes #1813 Deadlock during nng_close() - multi platform

gdamore commented 5 months ago

This looks like a good find. Furthermore, it looks like nni_aio_abort also suffers from the same flaw.

I want to look at this in more detail later today before I move forward.

gdamore commented 5 months ago

I think I've convinced myself that this is precisely the right fix, and we just need to add the same change to the implementation of nni_aio_abort.

alzix commented 5 months ago

i'm on it

gdamore commented 5 months ago

So there are other callers.

Basically we also need the same logic in nni_aio_cancel, and I think nni_aio_fini. It looks like it was missed in all the paths where we tear down or abort an aio.

alzix commented 5 months ago

it does not solve the Windows issue though :( On Windows the issue seems to be different image

it seems that on Windows it is a TOCTOU issue in the nni_sock_shutdown

it is stuck on

// We have to wait for pipes to be removed.
while (!nni_list_empty(&sock->s_pipes)) {
=>  nni_cv_wait(&sock->s_cv);
}

while s_pipes is already empty image

the issue predates the fix - so it is better to address it in another PR

gdamore commented 5 months ago

I'm merging this... the hang waiting for pipes to be empty feels like a missed cv_wake somewhere. I'll look for it later. I'm out of time for today.

alzix commented 5 months ago

created a new ticket for this #1827