tomcucinotta / distwalk

Distributed processing emulation tool
GNU General Public License v3.0
1 stars 4 forks source link

Timeout & Retransmit for FORWARD #34

Closed tomcucinotta closed 6 months ago

tomcucinotta commented 9 months ago

Currently, a FORWARD is actually a RPC to another dw_node, waiting for a reply. Real services would not wait forever, but implement a timeout within which to receive a reply. Once the timeout expires, the node should be able to either fail the single request towards the client (no need to close the connection in this case), or retry up to a specified number of times. This could be added as an optional argument to the FORWARD command, or as an independent command that might be more general (e.g., being applicable also to other commands, or a sequence of them, e.g., having a timeout firing if all computations up to some point in the request do not complete on time). What happens exactly to the workflow in-progress that didn't complete on time, is to be defined (e.g., aborting it might not be trivial).

Example 1:

TIMEOUT 50us, num_retries=0
FORWARD....
...
REPLY

If the reply to the FORWARD does not come within 50us, the request is failed to the client (a possible REPLY coming after the timeout failed will be ignored by dw_node)

Example 2:

TIMEOUT 50us, num_retries=1
FORWARD....
...
REPLY

If the reply does not come within 50us, try sending the FORWARD again, this time it would fail if the reply does not come within further 50us.

Not sure if, for the retry, we should allow for specifying an alternate endpoint to the FORWARD. Perhaps a way might be to realize the retry as a conditional SKIP:

1 TIMEOUT 50us, skip_on_fail=2
2 FORWARD ip1....
    ...
   REPLY
3 SKIP 2
4 TIMEOUT 50us
5 FORWARD ip2....
    ...
   REPLY

This would be executed as a RPC to ip1, if succeeds, skip the subsequent 2 commands (4 TIMEOUT and 5 FORWARD), otherwise if the timeout fires, then skip the 2 FORWARD...REPLY and 3 SKIP, thus jump to 4 TIMEOUT, that would retry for another 50us with a FORWARD to a different endpoint ip2, this time failing if the REPLY doesn't come on time.

This feature might be mixed with the multi-forward fork/join from #8, in this case the abort policy seems easy, whilst the retransmit might need some design clarifications.

tomcucinotta commented 6 months ago

Addressed by the various commits above, closing.