Outgoing transfer speed going down after sending 8MB packets.

kgretzky commented 7 years ago

Hi!

I'm having a weird issue that I've encountered while stress testing the library with websockets server running in GO. When I'm sending large fragmented packets of around 8MB, first packet is sent with normal speed and then the next one is being sent much much slower. This happens randomly and if I restart the server while the client is still running, the same will happen almost immediately.

It helped a bit when I increased the TX packet buffer size from 4KB to 128KB, but the issue persists. The larger the buffer I'm sending the higher the chance of the persistent slow down appearing. I'm running the tests on Windows 7 x64.

Here is the gif I recorded showing output of client and server when the issue appears. The bigger window shows the server output. As can be seen, first 8MB packet is sent super fast and the second one, which is sent after 10 seconds is experiencing the issues.

lws_stutter

Let me know if I can provide you with any additional information that may help. I can also share the source code of the stress tests I'm testing on if that helps.

lws-team commented 7 years ago

What part of that animation reflects the problem?

kgretzky commented 7 years ago

The part when the server starts to receive the second packet. Retrieval of first packet happens very fast almost instantaneously, while when second one is being retrieved it is being retrieved with significant slowdowns.

lws-team commented 7 years ago

Have you tried:

sending the same go server the same amount of data from another source (eg, from browser JS)? Ie, does the issue start from the server side or the lws side?
trying your client app on Linux? Windows does not have a real poll(), we simulate it.
trying to build with libuv? Again this will avoid the simulated poll() on windows if it is related.

Basically lws works by sending as much into the tcp window as possible (of course, it depends on your user code too). So it's normal that if the window fills, transmission will happen at intervals (after the remote side sends ACKs). But assuming the

kgretzky commented 7 years ago

I had a suspicion that it may be happening, because of some system-level throttling. Strangest thing is that even if I wait these 10 seconds between sending packets and make sure the queue is empty, the issue still persists.

Thanks for suggestions, I will try them out. I think writing a client written in GO that sends same amount of traffic will give me some ideas.

kgretzky commented 7 years ago

Andy,

I've just found out exactly how the bug is triggered and is very easy to reproduce for me. The transfer delay is tightly related to the timeout_ms value passed in lws_service() call. Let me explain how I managed to trigger it based on example values:

Let us take that the buffer size for the protocol we use is tx_size = 128*1024 bytes. If I send first packet of size p_size <= 2*tx_size, the next packet will be sent fine without any delays during transmission. If I send first packet of size p_size > 2*tx_size (in this example over 256*1024 bytes), the next packet will sleep for timeout_ms (passed to lws_service()) every time tx_size number of bytes are sent.

When I lowered the buffer size to 64*1024 and set timeout_ms to 1000ms, when sending the second packet I had 1 second pause after every 64*1024 bytes were sent.

Let me know if you have any ideas what could be the cause of such behavior.

It may seem that when the bug is triggered, once every two executions of lws_service() call, it has nothing to do and waits for the duration of the timeout_ms value.

lws-team commented 7 years ago

Do you involve threads anywhere?

lws-team commented 7 years ago

Or perhaps a better way to put it, how and where do you ask for the writeable callback?

kgretzky commented 7 years ago

No, everything is ran from a single thread.

After checking briefly the library code, my intuition tells me that lws_service_adjust_timeout() may be the culprit returning non-zero when there are still TX buffers to be handled, which is then passed to WSAWaitForMultipleEvents.

Whole code is one simple while loop. Writeable callback takes what is in the pending buffer and runs lws_write on tx_size chunks of that buffer.

kgretzky commented 7 years ago

When I set timeout_ms for lws_service() everything works perfectly fine apart from CPU running at 100%. I'm pretty sure it must be related to some events not triggering when they should when WSAWaitForMultipleEvents() is called in _lws_plat_service_tsi().

mjmessina commented 7 years ago

As Andy stated, windows simulates the polling. The lib also needed to get around the 64 limit in wait for multiple events.

I also think there may be an issue. I have worked around it and can stream Gigs of information from multiple threads on windows w/ no issues. I completely rely on the signals and do not need any tight loop polling.

Some recommendations:

Don't send more than the buffer size specified in rx_buffer_size (defaulted to 4k).
Always call lws_partial_buffered to see if you can send, if you can't back off and call lws_callback_on_writable. You manage the real buffer, send in chunks less than rx_buffer_size, and call lws_callback_on_writable to continue sending until your complete buffer is sent.

kgretzky commented 7 years ago

Thanks!

I will definitely check that lws_partial_buffered as I was even thinking of doing such checks but didn't know it is possible.

mjmessina commented 7 years ago

This always bugged me. I think I found the issue.

In lws_write, there is a check not to send more than the buffer length and then chunks it. If lws_issue_raw sends a good chuck out, it just returns and does not set lws_callback_on_writable.

The impact of this is the rest of the chucks will get sent out on the time outs because the event is not set.

kgretzky commented 7 years ago

Great find, Mike!

I was actually looking at this function today, but couldn't understand fully how it works. Do you think that after sending a good chunk out, it should see if there is data still pending to be sent and if so, call lws_callback_on_writeable before returning, to signal the event?

mjmessina commented 7 years ago

Yes I believe so.

in output.c I put the code below in in and it works fine. Just don't know the full impact of it. right by the send_raw: (lable) in the switch statement for LWS_WRITE_HTTP

       int sendSize = lws_issue_raw(wsi, (unsigned char *)buf - pre, len + pre);
        if( sendSize < orig_len ) {
            lws_callback_on_writable(wsi);
        }
        return sendSize;

kgretzky commented 7 years ago

Thanks a lot for this snippet. Will test with that fix as well. Maybe @lws-team can throw in his 5 cents if that breaks any important logic?

lws-team commented 7 years ago

Thanks @mjmessina ... no it shouldn't break anything. I'll try it out here on Linux as a general fix later thismorning... I guess if it impacts anything, it's probably needed generally and there's code for partials that was covering for the lack periodically.

lws-team commented 7 years ago

lws_issue_raw() itself already manages POLLOUT wait (== lws_callback_on_writable()) if there was a truncated send... there only seem to be two ways out of there, it either returns -1 meaning something bad happened, or the length it was called with. If it wasn't able to send everything, it will have called lws_callback_on_writable() itself already.

In the patch it's called with length len + pre, but you are checking the return against orig_len; what the patch actually does then is force POLLOUT active every time you start a ws packet, which will have pre nonzero. So I think if this helps, it helps accidentally by shotgunning POLLOUT around and doesn't directly reflect the root cause.

@kgretzky does the patch from mjmessina get rid of your symptom?

mjmessina commented 7 years ago

Actually I just updated.

I was on v1.7-stable. It doesn't work for me on the newer version. Same symptom where the chucks are sent out on the time outs. I would be surprised if it worked for him.

lws-team commented 7 years ago

@mjmessina Thanks... is there anything that seems to trigger the transition to having to sit there for the poll() timeout? It seems like it doesn't start out like that.

Is it even related to partial sends actually? What I am not understanding is if we missed out a lws_callback_on_writable(), it's not that it should get handled once per service it should stall and never get handled, AFAIK. Any debugging clues very much appreciated.

mjmessina commented 7 years ago

I was on my own fork that was based off of v1.7-stable or v2.0 (pre my event fix). Will try and debug more, because I do want to get off my version and update to v2.3.

Getting closer, will debug more tomorrow. I think the lws_callback_on_writable() is getting reset during the handling in lws_service_fd_tsi() Not sure.

mjmessina commented 7 years ago

I hope I am not jumping the gun here again.... It's the logic to adjust timeout_ms in _lws_plat_service_tsi in lws-plat-win.c

The timeout should go to zero if the socket can still write and there is some FD with pending data to send.

lws-team commented 7 years ago

No... sending data must be regulated by POLLOUT. It's the only thing we have connected to the remote side consuming data.

Forced service is for situations where we have local data that is never going to receive another network event to trigger consuming it. Eg

we know openssl received more data than we took from it last time and it has some buffered at openssl layer
we know we have non-ssl buffered rx we did not parse yet
we know an extension like permessage-deflate has more inflated data coming (eg network received RLE for 10K of 0x00, only needing a few bytes, but we must deliver the inflated version 1KB at a time)

For these kind of cases, the network action we could trigger from has already passed, and we must force the service without waiting for anything from the network.

What should happen if we just want to send more is lws_callback_on_writable() sets POLLOUT event on that socket. When we enter poll() wait on it (or whatever abomination it is for windows) the poll() should return immediately by itself if it sees we could already write more on that socket. Only if the connection is stuffed, ie the tcp window filled and no ACK was received yet, and no other connection has a pending network event, should we idle in poll() as we seem to be doing.

Before entering the "poll()" it'd be good to dump the .events of the pollfd on the active sockets and confirm POLLOUT is always set on the one we are trying to write to.

mjmessina commented 7 years ago

yes POLLOUT is set, but the event is not triggering. my quess is that WSAEWOULDBLOCK never triggered on the socket, so the write event does not get set.

like you said, putting the timeout to zero would be bad. I would recommend setting the event in the loop prior to the lws_service_adjust_timeout() call. That way the WSAEnumNetworkEvents would kick in and do the right thing.

I even think some of the code can be cleaned up too The loop would look something like:

for (i = 0; i < pt->fds_count; ++i) { pfd = &pt->fds[i];

    if (!(pfd->events & LWS_POLLOUT))
        continue;

    wsi = wsi_from_fd(context, pfd->fd);
    if (wsi->listener)
        continue;
    if (!wsi || wsi->sock_send_blocking)
        continue;

    WSASetEvent(pt->events[0]);
}

mjmessina commented 7 years ago

Not sure of the complete impact - works for the issue I am seeing. Here is what I have now (just added the wsi->trunc_len check and set the event) :

for (i = 0; i < pt->fds_count; ++i) {
    pfd = &pt->fds[i];

    if (!(pfd->events & LWS_POLLOUT))
        continue;

    wsi = wsi_from_fd(context, pfd->fd);
    if (wsi->listener)
        continue;
    if (!wsi || wsi->sock_send_blocking)
        continue;

    pfd->revents = LWS_POLLOUT;
    n = lws_service_fd(context, pfd);
    if (n < 0)
        return -1;

    /* if something closed, retry this slot */
    if (n)
        i--;

    if( wsi->trunc_len ) {
        WSASetEvent(pt->events[0]);
    }
}

lws-team commented 7 years ago

I don't understand how that helps.

The win32 code was contributed and seems to be doing completely broken things.

I guess the "poll()" part is supposed to be this

    ev = WSAWaitForMultipleEvents( 1,  pt->events , FALSE, timeout_ms, FALSE);
    if (ev == WSA_WAIT_EVENT_0) {
...

But that is nothing about poll()... from https://msdn.microsoft.com/en-us/library/windows/desktop/ms742219(v=vs.85).aspx the first arg is the number of events to wait on... fixed at 1?

What is the code you are editing above that even supposed to do? It just services anything with POLLOUT set on the .events, whether it's ready to send anything else on the socket or not.

The root cause for the mess is windows doesn't feel it needs to provide posix poll(), so it can't just share the known-working code for this from the Linux / posix implementation. And I don't have windows, or want to mess with its nonsense APIs.

We have has successful users as recently as v2.2 using libuv as the backend on windows, avoiding this. Can someone give that a try?

mjmessina commented 7 years ago

Agree windows really doesn't have good semantics for this type of usage.

I have worked around this and really don't need the fix. I never let the API chunk. My app controls it. If the API does chunks the sends there is an potential issue based on your time out value.

It helps by forcing the flushing of anything in the buffer due to chunking and there is no back pressure on the socket (i.e. hitting a WSAEWOULDBLOCK condition during a send).

No it does not just services anything with POLLOUT blindly. WSAEnumNetworkEvents() will do the proper checking, so if there is back pressure (i.e WSAEWOULDBLOCK is set). It will not service that connection.

For me I am good, I have worked around this.

lws-team commented 7 years ago

The WSAWaitForMultipleEvents() just checks the first socket... it can't be right unless I miss something major.

By contrast the correct implementation for *nix is

https://github.com/warmcat/libwebsockets/blob/master/lib/lws-plat-unix.c#L104

which waits on everything. It'll be great if someone can check libuv. with

cmake -DLWS_WITH_LIBEV=1 -DLIBUV_INCLUDE_DIRS=\path\to\libuv\include -DLIBUV_LIBRARIES=\path\to\libuv.lib

then I can retire this nasty stuff by requiring libuv on windows. As I say windows + libuv was working for users fine a few months ago.

Edit: and FWIW I build-test windows + libuv, eg, here

https://ci.appveyor.com/project/lws-team/libwebsockets/build/1.0.1846/job/355ahalofcrqm3e8

mjmessina commented 7 years ago

There is one event for all sockets. WSAEnumNetworkEvents() is the method that figures out which socket triggered the event.

lws-team commented 7 years ago

OK thanks for explaining that part of the madness.

I still go crosseyed looking at WSAWaitForMultipleEvents(1, ...) what is the "event" we are "waiting" for representing? When it iterates WSAEnumNetworkEvents() later it does it without reference to anything related to pt->events. Does WSAWaitForMultipleEvents() actually do anything except block us for timeout_ms secs?

Anyway I pushed the active part your patch, @kgretzky please try 0bb3646256b94599927f42aa186fb89a552dfc7b. If that helps from what I can make out as it is we never properly go idle in the way poll() does. It's much recommended to use libuv on Windows.

mjmessina commented 7 years ago

No WSAWaitForMultipleEvents() actually does wait for "a socket" to be signaled.

There is really no need for "Multiple events". It is just one event used by all sockets.
When setting up a socket you associate an event to be signaled, just one event is used for all sockets. Any socket needing servicing will signal the event.

Once the event is triggered, since one was used, you need to figure out what socket triggered the event. That is what WSAEnumNetworkEvents() does. Figures out what socket triggered the event and why the event was signaled (i.e. write, read, disconnect, etc).

kgretzky commented 7 years ago

Hey guys.

Sorry, I've been unavailable for the last 24 hours. Thank you for jumping in on the issue. I've just tested both of your patches and they do improve the sending, but only when I use large 256Kb protocol buffer size. If I try to use 4Kb, which is the default value it gets stuck after a short while, while trying to push 8Mb of data and goes into the same timeout delay loop.

I've also experienced this issue to appear right away in situation when I shut down the server and let the libwebsockets client try to retry the connection for several seconds. After I relaunch the server and client finally connects, the send clog would happen immediately. @lws-team #994 did fix this issue though.

I've tried @mjmessina solution to check lws_partial_buffered before trying to do lws_write, but it was actually never set as docs state This is never set at the start of a writeable callback, but any write may set it.. I do only one lws_write per callback.

UPDATE

I modified my code to do lws_write in a loop in the callback function and as @mjmessina suggested, I made it bail out with lws_callback_on_writable as soon as lws_partial_buffered returned 1.

Now it works perfectly even without the patches you both posted, so this must be the proper way to go on Windows to avoid issues with polling. I can now use even small 4Kb buffer and it sends out the packets blazingly fast. Tested with timeout of 1000ms and not even once it did clog and stop to wait.

It also picks up without aforementioned issues after I shutdown the server randomly.

lws-team commented 7 years ago

Okay... It feels a bit like the ratio of waits to having filled the pipe got reduced to being invisible by looping until the pipe was full, rather than fix the underlying issue on windows platform. But no visible problem is certainly good enough to be going on with. Thanks for the feedback and thanks for the help @mjmessina.

warmcat / libwebsockets

Outgoing transfer speed going down after sending 8MB packets. #994