nanomsg / nng

nanomsg-next-generation -- light-weight brokerless messaging
https://nng.nanomsg.org
MIT License
3.63k stars 472 forks source link

Gdamore/win io lock #1831

Closed gdamore closed 1 month ago

gdamore commented 2 months ago

This is a trial run, and it may or may not work.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 81.81818% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 79.50%. Comparing base (aac4dc3) to head (a5f1a8f).

Files Patch % Lines
src/sp/transport/tcp/tcp.c 0.00% 1 Missing :warning:
src/sp/transport/tls/tls.c 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1831 +/- ## ========================================== + Coverage 79.32% 79.50% +0.17% ========================================== Files 95 95 Lines 21491 21484 -7 ========================================== + Hits 17048 17081 +33 + Misses 4443 4403 -40 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

gdamore commented 1 month ago

Hi @alzix - obviously I didn't get a chance to try building this on Windows myself (except via CI/CID) before I ran out of time. I'll fix these .. .but if you've already done so, can you tell me if the approach at least resolves the problems you've been observing?

alzix commented 1 month ago

no i still got a crash image image

KnappSas commented 1 month ago

For the sake of completeness I want to mention that the reqrep code provided by @alzix also sometimes crashes in nni_stat_inc. It also looks like the code still hangs sometimes.

alzix commented 1 month ago

I also saw the crash in the nni_stat_inc especially when running with a debugger and stopping threads for some time. IMO we better concentrate on this critical issue first. Statistics can be compiled out if it cause problems. @KnappSas - can you please open a different issue for it - so it won't get forgotten

mikisch81 commented 1 month ago

HI @gdamore, I also tried this branch on a Windows machine and got the same crash: Screenshot 2024-05-13 at 3 44 57 PM

It seems that when calling nni_list_node_remove(&aio->a_prov_node) than a_prov_node has the next&prev pointers point to free or uninitialized memory, I wasn't able to understand where is this node gets initialized/freed.

It is very easy to reproduce as it happens almost immediately when running the modified reqrep demo.

itayzafrir commented 1 month ago

Hi @gdamore and all.

I've been looking into this and it seems as if the pipe is being deallocated/freed (in ipc_pipe_fini) but after that a call to ipc_send_cb is received (from win_io_handler). Inside ipc_send_cb the nni_aio pointer has the same address of the tx_aio member of ipc_pipe. Since tx_aio is a by value member and since the the ipc_pipe has been deallocated then this pointer is a dangling pointer.

image

Also worth mentioning that sometimes, not very often though there's no crash but a deadlock, where both client and server are waiting on signals.

gdamore commented 1 month ago

THanks -- I've been looking at this in spare cycles, but very few such cycles exist, which is why it's taking time. I'll try to get to it in a day or so.

gdamore commented 1 month ago

So the design is meant to ensure that all callbacks are done, and the pipe is totally unregistered before we freed it. I've missed something. I'm going to resolve this one way the other this weekend.

gdamore commented 1 month ago

CLosing I merged a different branch with different approach.