Closed yihuang closed 5 years ago
Thanks for bringing this up.
Maybe we should leave RemoteEndPoint in init state in step B.3, then step B.2 will continue to initialize it further.
That could work provided that B.2 doesn't die first due to a connection timeout expiring, as the later problem seems call for.
Another problem is the wait for crossed MVar in step B.2 is not with a timeout, if B's connect request failed to get Crossed reply for network issues, B.2 would wait indefinitely.
Looks like a problem indeed. Possibly solvable with a transport-wide timeout parameter.
Maybe we should leave RemoteEndPoint in init state in step B.3, then step B.2 will continue to initialize it further.
I have a vague recollection that the remote endpoints could pile up and leak if we don't remove them here, but I'd need to check the git history to be sure.
I think B.3 should wait for resolve MVar with timeout after firing crossed MVar, if B.2 get A's connect request successfully, then B.3 got notified, and re-use the RemoteEndPoint happily. If B.2 fails to get A's connect request, B.3 can still timeout.
Here's a better fix, I believe.
Have B.2 grab the remote endpoint state preventing any thread from modifying it. Decide if it is a crossed request. If it is, restore the old remote endpoint state. If it is not a crossed request, put the remote endpoint in Valid state.
This will fix the first issue because B.3 only removes the remote endpoint if it is in Init state. And it will fix the second issue, because B.2 does no longer need to wait on the crossed MVar.
I agree, it seems there is no race condition if we have this lock.
A sequence diagram to facilitate the problem understanding:
The problem is in step
B.2
, after B waits for the crossed Event, B re-uses theRemoteEndPoint
value(code), but in stepB.3
B will close theRemoteEndPoint
object and remove it(code). I think it's problematic when stepB.2
andB.3
are executing concurrently.
Actually, B.3 do wait for the "resolved" MVar (at findRemoteEndPoint) before checking for the possible removal of the RemoteEndPoint at createConnectionTo. But the "resolved" will be put and visible to B.3 only after B.2 gets the "crossed" MVar from B.3, sends ConnectionRequestAccepted and set RemoteEndPointValid at resolveInit. And at the time of checking for the removal at createConnectionTo, the endpoint is no longer in the Init state anymore (modifyMVar for the remoteState makes sure that transition from the Init state is completed). So it looks like we are safe here.
Another problem is the wait for crossed MVar in step B.2 is not with a timeout...
As for this problem, we can tryPut the "crossed" MVar at resolveInit. If there will be no ConnectionRequestCrossed reply for B.1 request, it will timeout and call the resolveInit anyway at setupRemoteEndPoint.
Another problem is the wait for crossed MVar in step B.2 is not with a timeout...
As for this problem, we can tryPut the "crossed" MVar at resolveInit. If there will be no ConnectionRequestCrossed reply for B.1 request, it will timeout and call the resolveInit anyway at setupRemoteEndPoint.
See https://github.com/haskell-distributed/network-transport-tcp/pull/81.
I agree with Andriy. Fixed via #81.
Let me describe the whole process first:
The problem is in step
B.2
, after B waits for the crossed Event, B re-uses theRemoteEndPoint
value(code), but in stepB.3
B will close theRemoteEndPoint
object and remove it(code). I think it's problematic when stepB.2
andB.3
are executing concurrently. Maybe we should leaveRemoteEndPoint
in init state in stepB.3
, then stepB.2
will continue to initialize it further.Another problem is the wait for crossed MVar in step
B.2
is not with a timeout, if B's connect request failed to get Crossed reply for network issues,B.2
would wait indefinitely.