Timeout for the connection (low)(in)activity in getRange

sociomantic-tsunami / dlsproto

Distributed Log Store protocol definition, client, fake node, and tests

Boost Software License 1.0

3 stars 18 forks source link

Timeout for the connection (low)(in)activity in getRange #40

Open nemanja-boric-sociomantic opened 6 years ago

nemanja-boric-sociomantic commented 6 years ago

The legacy clients have implemented the algorithms for making sure that all nodes are alive and sending traffic. It is implemented via means of monitoring the events when the individual nodes finish running the request and observing the time for the remaining nodes to finish. In the case when client sees that the individual node needs a long time to complete the request, it would "timeout" and it would stop the request.

Ideally, the client should have a mean of aborting the request on a connection that seems staled, without client needing to track the individual nodes behaviours.

gavin-norman-sociomantic commented 6 years ago

Timing out an individual request-on-conn has two aspects:

Like a full request timeout, the client could simply ditch the RoC and ignore any future incoming messages for it.
It would also be nice to send a message to the node, telling it to stop. (This would handle the case of active but slow connections.)

I'm not sure how these two things would work together.

gavin-norman-sociomantic commented 6 years ago

It seems like the simplest way to implement this is as a kind of automated, per-RoC stop, using the standard Stop protocol.

gavin-norman-sociomantic commented 6 years ago

As for the value of the timeout, we discussed starting out with a non-configurable sanity check timeout (60s, say). We can tweak this or make it configurable, if needed.

gavin-norman-sociomantic commented 6 years ago

This would help: https://github.com/sociomantic-tsunami/swarm/issues/361