pubsub: messages aren't always received

dwlnetnl commented 8 years ago

It seems that mangos SUB sockets behave not the same as nn_sub sockets.

When I have only a single mangos SUB socket connected to a PUB, it works.
When I connected multiple mangos SUB sockets, only the first SUB gets messages.
When I connect one or more nn_sub sockets (nanocat), all of those get the messages.

This might be the same issue as reported in the last comment of #163.

dwlnetnl commented 8 years ago

Maybe this behaviour has to do with starting the subscribers before the publisher. I'm using TCP sockets.

gdamore commented 8 years ago

Order of start shouldn't matter, provided you wait long enough for TCP sessions to establish. If you're starting the side that does Listen after the side that does Dial, this can take quite a little while, in the very worst case (if the connect attempts have been failing for a while) up to a minute.

The way it is supposed to work is that dialers fail by default at 100 msec. Each time they retry, they double the time to wait until the next retry up until a maximum of 1 minute. I think it should take 1-2 minutes to ratchet up to this longer time.

Probably this can be made tunable -- right now they are fixed values in core.go, see reconntime and reconnmax. I'd be happy with a PR that made these tunable. I think also the full minute timeout as a default seems probably overkill; somewhere between 10-30 seconds feels more appropriate. Also, the timeouts could increase at a linear rather than geometric rate, e.g. by adding 100 msec each time instead of doubling. Its not immediately clear to me what the best approach here is. (The algorithm could be made tunable by adding two factors, a multiplier, and an addend. Right now the multiplier is 2 and the addend is 0.)

gdamore commented 8 years ago

All the above aside, if after you've seen TCP sessions establish (e.g. after waiting a couple of minutes), and you still see the problem, then that would represent a bug in mangos. In that case I'd appreciate a test case.

dwlnetnl commented 8 years ago

The strange thing for me is that it doesn't work (mangos) when I start the PUB literally after the SUBs. But I will look with Wireshark to diagnose the problem in more detail.

gdamore commented 8 years ago

You still have to wait for the TCP sessions to establish. The minimum wait time is 100 msec if you start the SUBs (which are normally the Dialers) first. Probably you should wait more like 200-500 msec.

dwlnetnl commented 8 years ago

I've looked to the nanomsg source and indeed the minimum wait is by default 100 ms, but the maximum wait is 0. In mangos the maximum wait is 1 second. Personally I'd like to see values in the 2 implementations to minimise surprises.

But when I change the maximum wait to time.Duration(0), the reconnect packets seem to be buffered. The messages from nanomsg are every 100ms.

If you'd like I can supply pcap files.

dwlnetnl commented 8 years ago

The pull request is a rough source change. Docs are not changed and the option to change NN_RECONNECT_IVL and NN_RECONNECT_IVL_MAX isn't there. No problem add that.

gdamore commented 8 years ago

So, with the new behavior, addressing the TCP slow start problem, do you still have problems with pub/sub receiving messages? If not, let's close this bug.

dwlnetnl commented 8 years ago

No I don't have any problems anymore. Thanks for the additional coding of the options, great!

nanomsg / mangos-v1

pubsub: messages aren't always received #164