Closed lailoken closed 1 month ago
PS: The knock-on effect of this was also that PING/PONGs were not processed, sending us down incorrect rabbit holes.
It sounds like you found a gnarly problem with the windows plat stuff... it might be that what you're doing also creates a different problem in the scenario we are dealing with cleaning a dead socket for *nix.
Since I don't recall any outstanding problems on nix, I pushed on main a patch that makes your change only for windows and leaves it as-is for nix.
Thanks! Will test your user your patch instead, glad I could help!
Greetings,
In our scenario (and only on our Windows builds, Linux seems fine) we connect to dozens of servers at the same time. In the case where (about) half of these never connect (where the server is down etc.) all processing seems to stop for a few seconds, causing cascading timeouts in other connections.
Thus no scheduled callbacks are called for this time, nor are any other sockets processed (opened, closed, recv, send).
While debugging we noticed repeated calls in
code-net/service.c
in this case statement around line 770:So the connection has never connected (already died?) and because this returns a 1, the caller function assumes the connection is closing (it is not; it never connected) and retries the slot in the windows-specific code in
plat/windows/windows-service.c
:This has been the cause of hours of frustration and has rendered our windows platform unreliable. I am not sure what the correct fix is, but changing the service.c code to the following brought our windows platform to parity with our Linux one:
(Changed the return 1 to simply break)
In our scenario, connecting to 100 hosts of which 30 are down, would take about 2 minutes to complete in Windows and under a second in Linux... this change brings windows down from 2 minutes to completing 70 connects and 30 fails in under a second.
It does not feel right making a change in the common code to fix a platform-specific anomaly, but that change makes the most logical sense to me, so I'll just leave this as a suggestion.
I hope this helps! Marius.