lws_service seems to block on failed connect

lailoken commented 2 months ago

Greetings,

In our scenario (and only on our Windows builds, Linux seems fine) we connect to dozens of servers at the same time. In the case where (about) half of these never connect (where the server is down etc.) all processing seems to stop for a few seconds, causing cascading timeouts in other connections.

Thus no scheduled callbacks are called for this time, nor are any other sockets processed (opened, closed, recv, send).

While debugging we noticed repeated calls in code-net/service.c in this case statement around line 770:

case LWS_HPI_RET_WSI_ALREADY_DIED:
        pt->inside_lws_service = 0;
        return 1;

So the connection has never connected (already died?) and because this returns a 1, the caller function assumes the connection is closing (it is not; it never connected) and retries the slot in the windows-specific code in plat/windows/windows-service.c:

        for (i = 0; i < pt->fds_count; ++i) {
                pfd = &pt->fds[i];

                if (!(pfd->events & LWS_POLLOUT))
                        continue;

                wsi = wsi_from_fd(context, pfd->fd);
                if (!wsi || wsi->listener)
                        continue;
                if (wsi->sock_send_blocking)
                        continue;
                pfd->revents = LWS_POLLOUT;
                n = lws_service_fd(context, pfd);
                if (n < 0)
                        return -1;

                /*
                 * Force WSAWaitForMultipleEvents() to check events
                 * and then return immediately.
                 */
                timeout_us = 0;

                /* if something closed, retry this slot */
                if (n)
                        i--;
        }

This has been the cause of hours of frustration and has rendered our windows platform unreliable. I am not sure what the correct fix is, but changing the service.c code to the following brought our windows platform to parity with our Linux one:

case LWS_HPI_RET_WSI_ALREADY_DIED:
        pt->inside_lws_service = 0;
        break;

(Changed the return 1 to simply break)

In our scenario, connecting to 100 hosts of which 30 are down, would take about 2 minutes to complete in Windows and under a second in Linux... this change brings windows down from 2 minutes to completing 70 connects and 30 fails in under a second.

It does not feel right making a change in the common code to fix a platform-specific anomaly, but that change makes the most logical sense to me, so I'll just leave this as a suggestion.

I hope this helps! Marius.

lailoken commented 2 months ago

PS: The knock-on effect of this was also that PING/PONGs were not processed, sending us down incorrect rabbit holes.

lws-team commented 2 months ago

It sounds like you found a gnarly problem with the windows plat stuff... it might be that what you're doing also creates a different problem in the scenario we are dealing with cleaning a dead socket for *nix.

Since I don't recall any outstanding problems on nix, I pushed on main a patch that makes your change only for windows and leaves it as-is for nix.

lailoken commented 2 months ago

Thanks! Will test your user your patch instead, glad I could help!

warmcat / libwebsockets

lws_service seems to block on failed connect #3208