Open georgeyanev opened 1 week ago
Related Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Since EPOLLOUT is received but waitWrite does not wake up, I managed to work around this by introducing a timer waking up the waitWrite at certain intervals and checking connection status. See https://github.com/georgeyanev/go-raw-osfile-connect
For a 1000 connections there are ~200 timer inflicted wakeups.
I will be using this until (hopefully) this issue is resolved.
CC @rsc, @ianlancetaylor, @griesemer.
Go version
go version go1.23.3 linux/amd64
Output of
go env
in your module/workspace:What did you do?
I'm working with
os.File
for raw non-blocking socket communication. Not often I experience connect hangs from my client sockets. I made a small example with a client usingos.File
to wrap a non-blocking TCP socketThe example can be fetched here: https://github.com/georgeyanev/go-raw-osfile-connect
This client connects to the remote side (the server) in a loop and writes a message upon successful connection. Then it closes the connection.
For connecting I use modified code from
netFD.connect
in go's net package.The original
connect
code callsfd.pd.waitWrite
diectly, and I can not do that because I have no access to the poll descriptor. In the provided example, in order to achieve calling offd.pd.waitWrite
, I userawConn.Write
passing it a dummy function. The difference with the original code is that here, before callingfd.pd.waitWrite
,rawConn.Write
callsfd.writeLock()
andfd.pd.prepareWrite()
. I wonder if calling these two functions could cause the problem. And if so then there is no reliable way to callfd.pd.waitWrite
upon connect.Actually I can run a few hundred even a few thousand successful connects before hanging. Thats why there is a 100_000 times loop. When using standard tcp client code (net.Dial, net.Conn etc.) there is no such an issue.
Is this behaviour expected or it is an issue that should be fixed?
This issue is tested on:
using go versions: 1.22.6 and 1.23.3
What did you see happen?
I saw connect hanging after a few hundred or a few thousand requests.
In the following
sctrace -fTtt
of the client output I see theepoll
event for writing (EPOLLOUT
) is received from a PID different than the PID calledconnect
and then a newepoll_pwait
function is called from the PID calledconnect
this time waiting forever:And the program hangs from now on.
What did you expect to see?
I expect all 100_000 connect and write cycles to pass successfully.
I expect to be able to use use the non-blocking connect with os.File reliably. Please suggest if there is some other proper way for doing this