Open sjmackenzie opened 11 years ago
I think the first version is better. It does not make sense to me to request heartbeats. We know we'll always need some, so assume the other node wants us to send some.
Failure detection works at the level of single language entities
Why? Why not make failure detection work at the level of nodes? When a node's status changes, then change the status of all entities associated with that node. We don't need an instance of the FD for every entity, do we?
also am I missing anything?
I think your understanding of permFail is wrong. IIRC, permFail is never set by the FD. It can only be set explicitly through the Kill operation. The FD is there only to switch back and forth from 'ok' and 'tempFail'. It should be in Rahaël Collet's thesis. I'll try and find a pointer.
Oh and... One of the critical design goals of Mozart 2 was to be able to write the DSS entirely in Oz. Make sure you do ^^ If you have trouble figuring out how it can be done, let me know. But basically it is supported by the three following core mechanism:
I understand the need for fast failure detection, but isn't this very aggressive heartbeating? Possibly too aggressive? Is it not possible to use actor messages as a heartbeat too?
It is possible to avoid sending a heartbeat if a useful message was sent not long ago; and accept useful message received as also being a heartbeat.
I do not know if many messages are avoided doing so, but IIRC it is a common optimization, indeed.
Hi Ozers,
Very interesting discussion. Please find my comments below.
On Fri, Feb 15, 2013 at 7:23 PM, Sébastien Doeraene < notifications@github.com> wrote:
I think the first version is better. It does not make sense to me to request heartbeats. We know we'll always need some, so assume the other node wants us to send some.
Failure detection works at the level of single language entities
Why? Why not make failure detection work at the level of nodes? When a node's status changes, then change the status of all entities associated with that node. We don't need an instance of the FD for every entity, do we?
I agree that the failure detection should be node base instead of language entity. It will save a lot of bandwidth consumption. It is of course important to clearly define what is a node. I wouldn't do node=machine=ip-based, but node = process. So if you are talking to two oz processes on the same machine, one of them could crash while the other is still running.
also am I missing anything?
I think your understanding of permFail is wrong. IIRC, permFail is never set by the FD. It can only be set explicitly through the Kill operation. The FD is there only to switch back and forth from 'ok' and 'tempFail'. It should be in Rahaël Collet's thesis. I'll try and find a pointer.
+1 for FD just switching from OK to tempFail, and permFail only set by the kill operation. Question here: Is the kill operation perform on a node, or on a distributed entitiy? AFAIR, it could be performed in both, and it would affect all language entities associated to the node.
voilà, cheers Boriss
Oh and... One of the critical design goals of Mozart 2 was to be able to write the DSS entirely in Oz. Make sure you do ^^ If you have trouble figuring out how it can be done, let me know. But basically it is supported by the three following core mechanism:
- TCP connections, in modules OS and/or Open
- Serialization, in module Pickle, but also directly the boot module Serializer
The reflective layer (doc, work in progresshttps://github.com/mozart/mozart2/wiki/Reflective-layer )
— Reply to this email directly or view it on GitHubhttps://github.com/mozart/mozart2-vm/issues/4#issuecomment-13620404.
Question here: Is the kill operation perform on a node, or on a distributed entitiy? AFAIR, it could be performed in both, and it would affect all language entities associated to the node.
I'm quite certain you can kill a language entity. I remember having read this in Raphaël's thesis. It might be the case that one can kill a node too, I don't know.
On Sat, Feb 16, 2013 at 10:16 PM, Sébastien Doeraene < notifications@github.com> wrote:
Question here: Is the kill operation perform on a node, or on a distributed entitiy? AFAIR, it could be performed in both, and it would affect all language entities associated to the node.
I'm quite certain you can kill a language entity. I remember having read this in Raphaël's thesis. It might be the case that one can kill a node too, I don't know.
OK. So, if you have to entities, A and B, from node P, both entities A and B share the same failure stream, and you can monitor both. As extra, you could get a variable associated to the node P from A or from B. All three streams can be monitored.
If you do {Kill A}, the permFail value will appear on the failure stream of A, B, and P.
Do I get this right?
cheers Boriss
— Reply to this email directly or view it on GitHubhttps://github.com/mozart/mozart2-vm/issues/4#issuecomment-13675052.
If you do {Kill A}, the permFail value will appear on the failure stream of A, B, and P.
No, if you do {Kill A}, permFail will appear on the fault stream of A, but not B. I don't think there exists such a thing as the fault stream of a node.
It's not because the FD is node-based that the fault streams are node-based too. The fault stream of an entity A is derived from two sources of information: the suspicion state of its node, and its explicitly own state.
Internally, we have :
InternalStateOfP
is either tempFail or ok.InternalStateOfA
is either ok, localFail or permFail.Given these two sources of information, the observable fault state of A (the appearing at the end of its fault stream) is computed as follows:
case InternalStateOfA
of ok then InternalStateOfP
[] X then X
end
Does that make sense?
Overall theory of a heartbeat:
Some points I want to clarify:
Two or more approaches are available - note each approach uses asynchronous io (ie zeromq or nanomsg)
1) single phase protocol approach
2) use a two phase protocol
I believe version two will be slower as it has to wait for a round trip journey, also heartbeats could be lost on the wire. Whereas version one operates on the data at hand therefore faster to detect failure and send messages.
Please check the logic, also am I missing anything?