Closed david-eckardt-sociomantic closed 6 years ago
For reference: RequestOnConnSet
.
The obvious solution would be to make RequestOnConnSet.finished remove the RequestOnConn object from the set. Then Connection.setReceivedPayload would not be able to find it. (It would then silently ignore the received message; if this is a good thing to do here may be subject to discussion.)
I think it'd be nicer to mark ROCs in the set as finished, then throw a protocol error if get
looks one of them up.
We were unable to reproduce this behaviour with swarm v5.0.3.
Any idea why that might be?
I think this is only because the race condition. If all nodes happen to finish before the first node sends additional message, this will not happen, since the entire request will be removed from the set of active request.
We're able to reproduce it reliably with the neo-beta-3 because the protocol mismatch/handshake makes this situation happen every time.
I think it'd be nicer to mark ROCs in the set as finished, then throw a protocol error if get looks one of them up.
I agree, but we should be careful about the fiber leaks here. What if we have a node where the request on conn exits due the node error every time and the request can't ever be continued, and we have a long living request on the other nodes? This request on conn would live forever, and if we spawn many such request, we'll get up in a problem.
I admit it's a bit theoretical assumption, though.
We still need to figure out how this happens on a node restart.
My assumption is that we have the similar situation, but just with the node_disconnected
message. If something happens to be in the connection buffer, there might be a race condition (I'm not sure where, though) when the receiver loop still dispatches the message, but the RequestOnConn
has finished (maybe the request on conn dies first, then we shutdown the receiver loop?), and the entire request is still alive because other nodes are connected.
The obvious solution would be to make RequestOnConnSet.finished remove the RequestOnConn object from the set. Then Connection.setReceivedPayload would not be able to find it. (It would then silently ignore the received message; if this is a good thing to do here may be subject to discussion.)
I think it'd be nicer to mark ROCs in the set as finished, then throw a protocol error if get looks one of them up.
Why would that be better?
I guess I was thinking that even when an ROC is finished, it's still part of the request and hence makes sense to keep in the set. That's more of a conceptual point, though, so shouldn't override actual code concerns, if it's awkward to implement that way.
I'm asking because there might be a reason why finished ROCs are currently left in the ROC-Set until the whole request is finished. It is unfortunately not documented (my bad). Could it be because they shouldn't be returned into the pool too early? :thinking:
One place where we need to be careful is reusing the request id. If you get the ROC from the pool for the new request, will the swarm use it's (recycled) id?
On a protocol mismatch the following situation can happen for an all-nodes request:
RequestOnConn
handler fibers are started, one for each node, and connect to the nodes.RequestOnConnSet.finished
, which leaves theRequestOnConn
object registered with theRequestOnConnSet
. The request is still active because the other fibers haven't started handling the request yet.RequestOnConn
handler fiber has just terminated sends another message for the request.Connection.setReceivedPayload
looks up the request by ID, from that it gets the -- terminated but still registered --RequestOnConn
handler by node address, and callssetReceivedPayload
.setReceivedPayload
attempts to resume the terminatedRequestOnConn
fiber. :boom:The obvious solution would be to make
RequestOnConnSet.finished
remove theRequestOnConn
object from the set. ThenConnection.setReceivedPayload
would not be able to find it. (It would then silently ignore the received message; if this is a good thing to do here may be subject to discussion.)This is the original stack trace:
Although we reproduced the same stack trace on a protocol mismatch and determined the aforementioned chain of events, it originally happened when a client using swarm v4.4.0 was connected to two nodes with a compatible protocol, and one of the nodes was restarted. We still need to figure out how this happens on a node restart. We were unable to reproduce this behaviour with swarm v5.0.3.