nanomsg / nng

nanomsg-next-generation -- light-weight brokerless messaging
https://nng.nanomsg.org
MIT License
3.84k stars 492 forks source link

nng_dial for sub0 called synchronously is returning before the socket is ready to receive messages #1644

Closed pomaroff closed 11 months ago

pomaroff commented 1 year ago

Describe the bug Using nng_dial synchronously (without the flag NNG_FLAG_NONBLOCK) for a sub0 socket can return success before the underlying socket is actually ready to receive published messages. This means that a message published shortly after the nng_dial function has returned may not be received.

Expected behavior I would expect that if nng_dial is called synchronously (without the flag NNG_FLAG_NONBLOCK) and it returns a value of 0, then the socket would be ready to receive published messages.

Actual Behavior In a scenario where a published message is expected shortly after nng_dial has returned, that message may be never received.

To Reproduce See this modified demo: pubsub.zip The error can be reproduced using these command-lines:

pubsub.exe server ipc://./pipe/test
pubsub.exe client ipc://./pipe/test

I would expect this demo to run indefinitely, however, on my Windows system it is stopping after a short time (<5 minutes).

Environment Details

Additional context Uncommenting line 100 and thus introducing a short nng_msleep(1) after nng_dial seems like it makes it significantly more robust (to the point where I haven't yet reproduced the behaviour with this sleep).

The problem also seems to occur with TCP/IP, although I have not tested this extensively.

pomaroff commented 1 year ago

FYI I have just had the problem occur with nng_msleep(1) as well, so evidently this isn't quite enough to work-around this problem.

gdamore commented 1 year ago

Yeah, I'd guess msleep 10 would be better. :-)

Due to the way Windows IPC works, there are some asynchronicity things going on under the hood that make it hard to be fully synchronous. TCP (and IPC on UNIX systems) doesn't suffer this.

I'll look again to see if there are things I can do to make this better. The entire IPC mechanism for Windows is a strange beast, as it is entirely unlike UNIX domain sockets.

gdamore commented 1 year ago

dial also always runs more or less asynchronously, even on TCP and UNIX sockets -- because of the way NNG is architected. There isn't really a great solution to this, as at present there is no two-way handshake. We could add a complete two way handshake, but that would make the setup take a little longer, and it would break compatibility (which is the real reason we haven't done it yet).

pomaroff commented 1 year ago

It is perhaps an unusual situation where it is desirable to have complete synchronization for the dial operation, however, if it's possible to implement a complete two way handshake perhaps we could avoid breaking compatibility by only doing this if a different flag is passed to the dial function?

gdamore commented 1 year ago

The problem is that any way you do it, you have to implement changes to the wire protocol.

I want to fix this someday, but it requires a protocol update. Not going to happen this week. :-). And probably not the next either. :-)

It is possible to implement this synchronization in your application protocol btw. I just can't do it at the NNG / SP protocol without breaking compatibility for other SP applications.

gdamore commented 11 months ago

I'm going to close this for now. It's not something I'm going to do until I refactor the entire wire protocol.