Sockets stop reading after a big packet is transferred

hellerve commented 2 years ago

Bug description

When large enough packets are transferred, the socket (at least on macOS) will become unresponsive. The behavior seen is that the write semaphore on the socket is signalled, but the read semaphore never is (this can be observed inside dataHandler in SocketPluginImpl.c—AIO_W is set, signalling the write semaphore, but AIO_R is not and doesn’t seem to ever get signalled again).

We confirmed that signalling the read semaphore as well when AIO_W is set fixes the issue (see also https://github.com/akgrant43/opensmalltalk-vm/commit/dc888ece23098ddb44d88033c06498aa91b9ff99 disregarding the calls to puts).

To Reproduce

Steps to reproduce the behavior:

You can reuse the code provided in #10980, changing the data length to something “big”. Anything upwards of 1,000,000 will trigger the issue on the first packet, anything upwards of 200,000 should lead to lock up after a while.
Running one server and one client should suffice.
If the packet is large enough, the first packet will lock up the socket. Otherwise the socket should lock up after a while.

Expected behavior The data transfer of 10,000 packets should complete.

Screenshots No screenshots seemed conducive for these experiments at the time, I’d be happy to provide anything though.

Version information:

OS: confirmed on arm64 macOS, experiments on other platforms will follow
Version: 12.2.1
Pharo Version: 9.0.13

Expected development cost The fix outlined above seems to do the trick and not cause any other problems, but of course this might just be because we didn’t test the right things :) I’d be happy to help resolve this or provide additional context if needed.

Additional context Let me know if you need anything else!

Cheers

welcome[bot] commented 2 years ago

Thanks for opening your first issue! Please check the CONTRIBUTING documents for some tips about which information should be provided. You can find information of how to do a Pull Request here: https://github.com/pharo-project/pharo/wiki/Contribute-a-fix-to-Pharo

hellerve commented 2 years ago

I just tried to reproduce this issue with the latest version of Pharo 10 (build 1575) on the same platform (macOS 12.2.1). It seems that while occasional hangups still happen (every few thousand packets or so), it always recovers if I give it a few seconds to a minute.

To ensure I wasn’t too impatient before, I ran the tests again on Pharo 9. There the socket does not seem to recover, I can leave it open until I run into a ConnectionTimedOut error (10 minutes by default I think?).

Ducasse commented 2 years ago

Thanks for the report. We are chasing a bug with a strange slow down in socket and we would love to be able to reproduce it. @tesonep

daniels220 commented 1 year ago

I just tried the reproduction in Pharo 11[^1] on Linux. Message size 1MB ±16kB, 4 clients, 10k reps each. Three of the four clients finished their run normally, but with the fourth one, it sent the 3405th packet and the server encountered this bug, so the server is just sitting there with an open but broken socket and the client got a "Data receive timed out" waiting for the response.

So, in that environment it's definitely a lot better than the reported issue, but it still seems to be barely possible to have a problem.

EDIT: I re-ran with smaller-but-still-large messages (256kB ±16kB) and more clients (8)—half the total data exchanged. This was overall more problematic—three of the clients experienced a timeout due to server-side failure of the socket.

All eight clients also had the dataAvailable warning from the watchdog appear in their own Transcript, between 3 and 16 times[^2] over the course of the last thousand or so messages (which is as far back as the Transcript preserved using its default settings). Looking at the repro and watchdog code, this part may be a false alarm—a race condition where the socket has data available, the semaphore has been signaled, but the data hasn't been read yet when the watchdog takes a look.

I also think the fact that the client code has a timeout set on the SocketStream would cause it to unstick itself after the timeout elapses (without actually reporting a timeout)—looking at Socket>>waitForDataFor:ifClosed:ifTimedOut:, it waits on the readSemaphore with a timeout, then checks dataAvailable even if the wait ends because the timeout expires rather than the semaphore being signaled. So if data became available sometime in between, it will happily proceed, and assuming the semaphore is properly signaled when more data arrives, everything will recover. However I'm sure this isn't what actually happened in my case, because the timestamps in the Transcripts don't have a big gap when the warning appears.

I don't know whether the server experienced any cases where the watchdog warned but it was able to recover as the Transcript contents have been overwritten by the long tail of warnings after the run completed (from the three sockets that did break).

One thing that jumps out at me about the repro code: why set the send and receive buffers to 4kB? On my test system, the defaults are 16kB send and somewhere around 85kB receive[^3]. I'll try leaving them alone and see if that makes a difference...seems like the smaller those buffers are, possibly the more often the semaphore will be involved?

EDIT 2: I tried running a whole bunch of clients (32) all in the same image—so just two images, client and server, but 32 sockets open between them. Message size 128kB ±8kB. One of the clients encountered the issue after 98 cycles, then I did something that broke the lorem ipsum generation and had to close the images—but it seems clear that, in the current state of things, message size is less important than total traffic across all sockets and (probably) number of simultaneously-active sockets.

[^1]: Smalltalk vm version output:
'CoInterpreter VMMaker-tonel.1 uuid: 55ac98c3-a5b7-0d00-9fe1-a0050fd32d39 May 9 2023
StackToRegisterMappingCogit VMMaker-tonel.1 uuid:
55ac98c3-a5b7-0d00-9fe1-a0050fd32d39 May 9 2023
v10.0.5 - Commit: 2757766 - Date: 2023-05-05 13:46:09 +0200 [^2]: 1: 9 warnings.
2: Timeout on 2212, 13 warnings.
3: Timeout on 6094, 3 warnings.
4: Timeout on 1309, 8 warnings.
5: 10 warnings.
6: 10 warnings.
7: 16 warnings.
8: 8 warnings. [^3]: 87380, which is 2 * 2 * 5 * 7 * 257...wonder if there's a logic to this?

daniels220 commented 1 year ago

I made some improvements to the socket load-testing code from #10980—pulled the data sizes to class variables, added similar configuration for send/receive buffer sizes, and attempted to improve detection of the hang case by examining the readSemaphore directly, not just the dataAvailable state. GToolkit-Utility-Sockets.st.txt

pharo-project / pharo

Sockets stop reading after a big packet is transferred #11083