Closed pomaroff closed 11 months ago
FYI I have just had the problem occur with nng_msleep(1)
as well, so evidently this isn't quite enough to work-around this problem.
Yeah, I'd guess msleep 10 would be better. :-)
Due to the way Windows IPC works, there are some asynchronicity things going on under the hood that make it hard to be fully synchronous. TCP (and IPC on UNIX systems) doesn't suffer this.
I'll look again to see if there are things I can do to make this better. The entire IPC mechanism for Windows is a strange beast, as it is entirely unlike UNIX domain sockets.
dial also always runs more or less asynchronously, even on TCP and UNIX sockets -- because of the way NNG is architected. There isn't really a great solution to this, as at present there is no two-way handshake. We could add a complete two way handshake, but that would make the setup take a little longer, and it would break compatibility (which is the real reason we haven't done it yet).
It is perhaps an unusual situation where it is desirable to have complete synchronization for the dial operation, however, if it's possible to implement a complete two way handshake perhaps we could avoid breaking compatibility by only doing this if a different flag is passed to the dial function?
The problem is that any way you do it, you have to implement changes to the wire protocol.
I want to fix this someday, but it requires a protocol update. Not going to happen this week. :-). And probably not the next either. :-)
It is possible to implement this synchronization in your application protocol btw. I just can't do it at the NNG / SP protocol without breaking compatibility for other SP applications.
I'm going to close this for now. It's not something I'm going to do until I refactor the entire wire protocol.
Describe the bug Using
nng_dial
synchronously (without the flagNNG_FLAG_NONBLOCK
) for a sub0 socket can return success before the underlying socket is actually ready to receive published messages. This means that a message published shortly after thenng_dial
function has returned may not be received.Expected behavior I would expect that if
nng_dial
is called synchronously (without the flagNNG_FLAG_NONBLOCK
) and it returns a value of 0, then the socket would be ready to receive published messages.Actual Behavior In a scenario where a published message is expected shortly after
nng_dial
has returned, that message may be never received.To Reproduce See this modified demo: pubsub.zip The error can be reproduced using these command-lines:
I would expect this demo to run indefinitely, however, on my Windows system it is stopping after a short time (<5 minutes).
Environment Details
Additional context Uncommenting line 100 and thus introducing a short
nng_msleep(1)
afternng_dial
seems like it makes it significantly more robust (to the point where I haven't yet reproduced the behaviour with this sleep).The problem also seems to occur with TCP/IP, although I have not tested this extensively.