Oz Heartbeating - Githubissues

sjmackenzie commented 11 years ago

Overall theory of a heartbeat:

Each node should have its own heartbeat.
Failure detection works at the level of single language entities (class instantiations, ports, etc)
Each node determines if a remote node fails.
Failure detection is done on each node,
Nodes don't tell other remote nodes that a certain remote node has failed
If there is a failure (permFail) the language level entities adapt, wait, restart, etc (this is programmed on another level)
A single slow node should not slow down the failure detection of other nodes
Failure detection should be as fast as possible
If a delay is increased a certain number of times for a given node only then the failure detector should indicate a failure for that node.
The heartbeat is adaptive
A forever loop sending messages.
If no messages are present within a delay, then increase the delay.
If there are messages then decrease the delay
increase or decrease the delay such that it is just high enough so that successful communication is achieved
delay is not a timeout, yet the delay should be as small as possible

Some points I want to clarify:

Two or more approaches are available - note each approach uses asynchronous io (ie zeromq or nanomsg)

1) single phase protocol approach

for each remote node set their delay to say 3 secs to allow for initial TCP connections
queue a heartbeat message to each remote node
send queue
     spawn an oz thread for each node and loop infinitely
          sleep for the time delay associated with remote node
          poll for heartbeats from that node
          if poll is empty
               increase node delay
               if node delay increased X times
                     flag remote node with permFail state
               else 
                     flag remote node with tempFail state
               queue heartbeat message to remote node
           else
                flag remote node alive state
                decrease node delay (factor in how many heartbeats are there)
                queue actor/language_entity messages to remote node
                queue heartbeat message to remote node
           send_queue
           poll for actor messages and process if any
           loop

2) use a two phase protocol

for each remote node set their delay to say 3 secs to allow for initial TCP connections
queue a heartbeat message to each remote node
send queue
     spawn an oz thread for each node and loop infinitely
          sleep for the time delay associated with remote node
          poll for heartbeat responses from that node
          if poll is empty
               increase node delay
               if node delay increased X times
                     flag remote node with permFail state
               else 
                     flag remote node with tempFail state
               queue heartbeat message to remote node
           else
                flag remote node alive state
                decrease node delay
                queue actor messages to remote node
                queue heartbeat message to remote node
           send_queue
           poll for actor messages and process if any
           loop

I believe version two will be slower as it has to wait for a round trip journey, also heartbeats could be lost on the wire. Whereas version one operates on the data at hand therefore faster to detect failure and send messages.

Please check the logic, also am I missing anything?

sjrd commented 11 years ago

I think the first version is better. It does not make sense to me to request heartbeats. We know we'll always need some, so assume the other node wants us to send some.

Failure detection works at the level of single language entities

Why? Why not make failure detection work at the level of nodes? When a node's status changes, then change the status of all entities associated with that node. We don't need an instance of the FD for every entity, do we?

also am I missing anything?

I think your understanding of permFail is wrong. IIRC, permFail is never set by the FD. It can only be set explicitly through the Kill operation. The FD is there only to switch back and forth from 'ok' and 'tempFail'. It should be in Rahaël Collet's thesis. I'll try and find a pointer.

Oh and... One of the critical design goals of Mozart 2 was to be able to write the DSS entirely in Oz. Make sure you do ^^ If you have trouble figuring out how it can be done, let me know. But basically it is supported by the three following core mechanism:

TCP connections, in modules OS and/or Open
Serialization, in module Pickle, but also directly the boot module Serializer
The reflective layer (doc, work in progress)

sjmackenzie commented 11 years ago

I understand the need for fast failure detection, but isn't this very aggressive heartbeating? Possibly too aggressive? Is it not possible to use actor messages as a heartbeat too?

sjrd commented 11 years ago

It is possible to avoid sending a heartbeat if a useful message was sent not long ago; and accept useful message received as also being a heartbeat.

I do not know if many messages are avoided doing so, but IIRC it is a common optimization, indeed.

bmejias commented 11 years ago

Hi Ozers,

Very interesting discussion. Please find my comments below.

On Fri, Feb 15, 2013 at 7:23 PM, Sébastien Doeraene < notifications@github.com> wrote:

I think the first version is better. It does not make sense to me to request heartbeats. We know we'll always need some, so assume the other node wants us to send some.

Failure detection works at the level of single language entities

Why? Why not make failure detection work at the level of nodes? When a node's status changes, then change the status of all entities associated with that node. We don't need an instance of the FD for every entity, do we?

I agree that the failure detection should be node base instead of language entity. It will save a lot of bandwidth consumption. It is of course important to clearly define what is a node. I wouldn't do node=machine=ip-based, but node = process. So if you are talking to two oz processes on the same machine, one of them could crash while the other is still running.

also am I missing anything?

I think your understanding of permFail is wrong. IIRC, permFail is never set by the FD. It can only be set explicitly through the Kill operation. The FD is there only to switch back and forth from 'ok' and 'tempFail'. It should be in Rahaël Collet's thesis. I'll try and find a pointer.

+1 for FD just switching from OK to tempFail, and permFail only set by the kill operation. Question here: Is the kill operation perform on a node, or on a distributed entitiy? AFAIR, it could be performed in both, and it would affect all language entities associated to the node.

voilà, cheers Boriss

Oh and... One of the critical design goals of Mozart 2 was to be able to write the DSS entirely in Oz. Make sure you do ^^ If you have trouble figuring out how it can be done, let me know. But basically it is supported by the three following core mechanism:

TCP connections, in modules OS and/or Open

Serialization, in module Pickle, but also directly the boot module Serializer

The reflective layer (doc, work in progresshttps://github.com/mozart/mozart2/wiki/Reflective-layer )

— Reply to this email directly or view it on GitHubhttps://github.com/mozart/mozart2-vm/issues/4#issuecomment-13620404.

sjrd commented 11 years ago

Question here: Is the kill operation perform on a node, or on a distributed entitiy? AFAIR, it could be performed in both, and it would affect all language entities associated to the node.

I'm quite certain you can kill a language entity. I remember having read this in Raphaël's thesis. It might be the case that one can kill a node too, I don't know.

bmejias commented 11 years ago

On Sat, Feb 16, 2013 at 10:16 PM, Sébastien Doeraene < notifications@github.com> wrote:

Question here: Is the kill operation perform on a node, or on a distributed entitiy? AFAIR, it could be performed in both, and it would affect all language entities associated to the node.

I'm quite certain you can kill a language entity. I remember having read this in Raphaël's thesis. It might be the case that one can kill a node too, I don't know.

OK. So, if you have to entities, A and B, from node P, both entities A and B share the same failure stream, and you can monitor both. As extra, you could get a variable associated to the node P from A or from B. All three streams can be monitored.

If you do {Kill A}, the permFail value will appear on the failure stream of A, B, and P.

Do I get this right?

cheers Boriss

— Reply to this email directly or view it on GitHubhttps://github.com/mozart/mozart2-vm/issues/4#issuecomment-13675052.

sjrd commented 11 years ago

If you do {Kill A}, the permFail value will appear on the failure stream of A, B, and P.

No, if you do {Kill A}, permFail will appear on the fault stream of A, but not B. I don't think there exists such a thing as the fault stream of a node.

It's not because the FD is node-based that the fault streams are node-based too. The fault stream of an entity A is derived from two sources of information: the suspicion state of its node, and its explicitly own state.

Internally, we have :

Node P can be suspected or not; this is computed by the FD running for node P. InternalStateOfP is either tempFail or ok.
Entity A can be OK, or could have been explicitly broken (Break -> localFail) or killed (Kill -> permFail). InternalStateOfA is either ok, localFail or permFail.

Given these two sources of information, the observable fault state of A (the appearing at the end of its fault stream) is computed as follows:

case InternalStateOfA
of ok then InternalStateOfP
[] X then X
end

Does that make sense?

mozart / mozart2-vm

Oz Heartbeating #4