Terminated RequestOnConn fiber resumed on unexpected message

david-eckardt-sociomantic commented 6 years ago

On a protocol mismatch the following situation can happen for an all-nodes request:

Multiple RequestOnConn handler fibers are started, one for each node, and connect to the nodes.
One of the fibers receives a message, fails to interpret correctly and terminates without shutting the connection down. (The fact that it does not initiate a shutdown is another issue; in any case the following should not happen.)
When terminating, the fiber calls RequestOnConnSet.finished, which leaves the RequestOnConn object registered with the RequestOnConnSet. The request is still active because the other fibers haven't started handling the request yet.
The node for which the client's RequestOnConn handler fiber has just terminated sends another message for the request.
Connection.setReceivedPayload looks up the request by ID, from that it gets the -- terminated but still registered -- RequestOnConn handler by node address, and calls setReceivedPayload.
setReceivedPayload attempts to resume the terminated RequestOnConn fiber. :boom:

The obvious solution would be to make RequestOnConnSet.finished remove the RequestOnConn object from the set. Then Connection.setReceivedPayload would not be able to find it. (It would then silently ignore the received message; if this is a good thing to do here may be subject to discussion.)

This is the original stack trace:

core.exception.AssertError@./submodules/swarm/src/swarm/neo/util/MessageFiber.d(559): attempt to resume a non-held fiber                                                                                                                                                                                                      
src/core/exception.d:438 _d_assert_msg
./submodules/swarm/src/swarm/neo/util/MessageFiber.d:559 _D5swarm3neo4util12MessageFiber17OceanMessageFiber6resumeMFS5swarm3neo4util12MessageFiber17OceanMessageFiber5TokenC6ObjectS5ocean4core10SmartUnion75__T10SmartUnionTS5swarm3neo4util12MessageFiber17OceanMessageFiber8Message_Z10SmartUnionZ9__requireMFZv
./submodules/swarm/src/swarm/neo/util/MessageFiber.d:558 ocean.core.SmartUnion.SmartUnion!(swarm.neo.util.MessageFiber.OceanMessageFiber.Message_).SmartUnion swarm.neo.util.MessageFiber.OceanMessageFiber.resume(swarm.neo.util.MessageFiber.OceanMessageFiber.Token, Object, ocean.core.SmartUnion.SmartUnion!(swarm.neo.util.MessageFiber.OceanMessageFiber.Message_).SmartUnion)./submodules/swarm/src/swarm/neo/connection/RequestOnConnBase.d:1734 void swarm.neo.connection.RequestOnConnBase.RequestOnConnBase.resumeFiber_(ocean.core.SmartUnion.SmartUnion!(swarm.neo.util.MessageFiber.OceanMessageFiber.Message_).SmartUnion)
./submodules/swarm/src/swarm/neo/connection/RequestOnConnBase.d:1779 void swarm.neo.connection.RequestOnConnBase.RequestOnConnBase.setReceivedPayload(const(void)[])
./submodules/swarm/src/swarm/neo/client/RequestOnConn.d:603 void swarm.neo.client.RequestOnConn.RequestOnConn.setReceivedPayload(const(void)[])
./submodules/swarm/src/swarm/neo/client/Connection.d:639 void swarm.neo.client.Connection.Connection.setReceivedPayload(ulong, const(void)[])
./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:651 void swarm.neo.connection.ConnectionBase.ConnectionBase.ReceiveLoop.receivedMessage(swarm.neo.protocol.Message.MessageType, const(void)[])
./submodules/swarm/src/swarm/neo/protocol/socket/MessageReceiver.d:214 void swarm.neo.protocol.socket.MessageReceiver.MessageReceiverBase.receive(lazy ocean.sys.Epoll.epoll_event_t.Event, scope void delegate(swarm.neo.protocol.Message.MessageType, const(void)[]), bool)
./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:603 void swarm.neo.connection.ConnectionBase.ConnectionBase.ReceiveLoop.fiberMethod()
src/core/thread.d:4277 void core.thread.Fiber.run()
src/core/thread.d:3523 fiber_entryPoint

Although we reproduced the same stack trace on a protocol mismatch and determined the aforementioned chain of events, it originally happened when a client using swarm v4.4.0 was connected to two nodes with a compatible protocol, and one of the nodes was restarted. We still need to figure out how this happens on a node restart. We were unable to reproduce this behaviour with swarm v5.0.3.

gavin-norman-sociomantic commented 6 years ago

For reference: RequestOnConnSet.

gavin-norman-sociomantic commented 6 years ago

The obvious solution would be to make RequestOnConnSet.finished remove the RequestOnConn object from the set. Then Connection.setReceivedPayload would not be able to find it. (It would then silently ignore the received message; if this is a good thing to do here may be subject to discussion.)

I think it'd be nicer to mark ROCs in the set as finished, then throw a protocol error if get looks one of them up.

gavin-norman-sociomantic commented 6 years ago

We were unable to reproduce this behaviour with swarm v5.0.3.

Any idea why that might be?

nemanja-boric-sociomantic commented 6 years ago

I think this is only because the race condition. If all nodes happen to finish before the first node sends additional message, this will not happen, since the entire request will be removed from the set of active request.

We're able to reproduce it reliably with the neo-beta-3 because the protocol mismatch/handshake makes this situation happen every time.

nemanja-boric-sociomantic commented 6 years ago

I think it'd be nicer to mark ROCs in the set as finished, then throw a protocol error if get looks one of them up.

I agree, but we should be careful about the fiber leaks here. What if we have a node where the request on conn exits due the node error every time and the request can't ever be continued, and we have a long living request on the other nodes? This request on conn would live forever, and if we spawn many such request, we'll get up in a problem.

I admit it's a bit theoretical assumption, though.

nemanja-boric-sociomantic commented 6 years ago

We still need to figure out how this happens on a node restart.

My assumption is that we have the similar situation, but just with the node_disconnected message. If something happens to be in the connection buffer, there might be a race condition (I'm not sure where, though) when the receiver loop still dispatches the message, but the RequestOnConn has finished (maybe the request on conn dies first, then we shutdown the receiver loop?), and the entire request is still alive because other nodes are connected.

david-eckardt-sociomantic commented 6 years ago

The obvious solution would be to make RequestOnConnSet.finished remove the RequestOnConn object from the set. Then Connection.setReceivedPayload would not be able to find it. (It would then silently ignore the received message; if this is a good thing to do here may be subject to discussion.)

I think it'd be nicer to mark ROCs in the set as finished, then throw a protocol error if get looks one of them up.

Why would that be better?

gavin-norman-sociomantic commented 6 years ago

I guess I was thinking that even when an ROC is finished, it's still part of the request and hence makes sense to keep in the set. That's more of a conceptual point, though, so shouldn't override actual code concerns, if it's awkward to implement that way.

david-eckardt-sociomantic commented 6 years ago

I'm asking because there might be a reason why finished ROCs are currently left in the ROC-Set until the whole request is finished. It is unfortunately not documented (my bad). Could it be because they shouldn't be returned into the pool too early? :thinking:

nemanja-boric-sociomantic commented 6 years ago

One place where we need to be careful is reusing the request id. If you get the ROC from the pool for the new request, will the swarm use it's (recycled) id?

sociomantic-tsunami / swarm

Terminated RequestOnConn fiber resumed on unexpected message #309