Open BenediktBurger opened 1 year ago
Do we want to differentiate between STATUS
requests out of "interest" for the status, and because of checking for life signs (HEALTHCHECK
? POKE
?) If yes, receiving a HEALTHCHECK
would signal to a Component that it is not sending heartbeats often enough, and it should increase the hearbeat frequency.
We will need a way to configure/coordinate desired heartbeat intervals. For later maybe: Probably some random jitter is a good idea so that we don't get "bursts" of heartbeats over our connections around every interval. There are known algorithms to deal with that.
I was previously lamenting that we lose message/response symmetry if we answer every message with a heartbeat. However, I think I found a way to deal with this conceptually: Heartbeats never "count" when tallying messages. With that idea I'm more comfortable with the idea of replying to every nonempty message with an empty one. Heartbeats also maybe won't be shown in sequence diagrams not specifically concerned with heartbeat mechanics, to not clutter up everything.
We could call it "Ping", if you desire a life sign.
If we have that fixed rule (respond empty), we can leave them out as they do not contain content and are easy to understand.
We will need a way to configure/coordinate desired heartbeat intervals.
I like the idea of the Zmq manual, which states, that the clients (not the server) know best their connection and heartbeat requirements. So my proposal is, that we have some lax heartbeat interval on the Coordinator (around 1 to 10 seconds), which is also the default of Components. Whoever needs a higher interval, may send heartbeats more often (and acquire thus replies as well).
An additional question: how empty shall an empty message be? As heartbeats are just between directly connected peers (that is a Coordinator and a Component or another Coordinator), the empty message could consist in a single, empty frame, without sender or recipient information: "||".
The Component understands: My connection peer (Coordinator) is still alive. The Coordinator receives the connection identity and knows, that the peer at the end of that connection is alive.
"PING": sounds good. Clients determine frequency: sounds good.
Empty frame heartbeat: I agree with your reasoning. Can we do logging well in that case? (It might be convenient to log heartbeats at debug level, for example)
I guess the exception in the message format for heartbeats does not unduly complicate things?
I guess the exception in the message format for heartbeats does not unduly complicate things?
You add a first check in the Coordinator:
identity, message = socket.recv_multipart()
handle_heartbeat(identity)
if message == [b""]:
return
In a Component you have:
while True:
msg = socket.recv_multipart()
if msg != [b""]:
break
Empty frame heartbeat: I agree with your reasoning. Can we do logging well in that case? (It might be convenient to log heartbeats at debug level, for example)
The Coordinator can log the identity (which has an entry in the directory) and a Component can log that it received a heartbeat of its Coordinator.
We drop that automatic response (which adds complexity in the message routing, as you have to decide, whether you will send an answer later, maybe an error message, or directly respond).
The idea:
If the answer is not an empty message, but a full fledged message with command PONG, you could PING some Component via Coordinators (e.g. from N1.CA to N2.CB) as well.
Instead of PONG, we could use a contentless message (e.g. V|Recipient|Sender|H
, in contrast to an empty message ||
).
A Coordinator expects a heartbeat interval of 700 ms. A Component uses a heartbeat interval of 1000 ms. Thus, after a while, the component will send a PING, which the Component answers. If there is no change, the same situation will recur again. If, on the other hand the Component says "I received a PING, so my heartbeat interval must be too long for the Coordinator's comfort, let's drop it by (e.g.) 10%", both will align naturally over time/a couple of iterations and avoid further PINGs. It's always the recipient of the PING that should adjust the interval, because the sender does not have a clear signal, as it does not know if the PING target is actually still alive (but the PING recipient knows that).
by @bilderbuchi in https://github.com/pymeasure/leco-protocol/pull/38/files#r1101914525
suggestion: After not having received a heartbeat from a peer for a chosen heartbeat interval, a Component should and a Coordinator shall send a PING and wait for a PONG response for 3 heartbeat intervals before considering a connection dead.
by @bilderbuchi in https://github.com/pymeasure/leco-protocol/pull/38/files#r1100683377
Regarding the 3 hearbeat_intervals:
I'd separate the "_heartbeatinterval" (time between two heartbeats) and the "_expirationtime", time when you get suspicious, that the other side is dead. Obviously, the second value should be larger the first value. Lastly we have a "_heartbeat_checkinterval", the time between two checks, whether a connection expired or not.
Did you mean, that the _expirationtime is 3 times _heartbeatinterval, @bilderbuchi ?
my idea is:
now = get_current_time()
last_message = get_time_of_last_message_of_Component()
if now > last_message + expiration_time + heartbeat_check_interval:
delete_connection()
elif now > last_message + expiration_time:
send_ping()
Isn't that too complicated? I'd rather not configure/keep track of 3 different time intervals, to be honest.
Assuming that your code snippet is part of an event loop, what about:
# "pinged_at" is 0 in the beginning
now = get_current_time()
last_message = get_time_of_last_message_of_Component()
pinged_since_last_contact = pinged_at > last_message
# this _accurately_ tolerates one heartbeat interval after the ping:
if pinged_since_last_contact and (now > pinged_at + heartbeat_interval):
delete_connection()
if now > last_message + heartbeat_interval and not pinged_since_last_contact:
send_ping()
pinged_at = now
I'd say forget about the 3 intervals. I suspect I based this on an now outdated notion of the reconnection process. What I wanted to achieve is that the heartbeat-ping mechanism leaves enough time between one Component realising the Coordinator is dead (restarting) for all Components being reconnected, before declaring failure. That could make for a smoother reconnection process. Let's revisit that when we have the first iteration done.
pinged_at + heartbeat_interval == last_message + 2 heartbeat_interval
, so we do not have to store that additional information.
Basically you set all three variables to the same value?
I would not set the expiration time to the heartbeat_interval: Maybe the heartbeat will arive immediately later.
We could set the _expirationtime to some factor (1.1 for not a lot or 2), so : now > last_message + heartbeat_interval * factor
(factor is going to be defined).
This reduces to my original proposal, but defining the variables in terms of heartbeat_interval:
if now > last_message + 3 * heartbeat_interval:
delelete_connection()
elif now > last_message + 2 * heartbeat_interval:
send_ping()
with the _heartbeat_checkinterval I did not want to check the heartbeats expiration between pinging and expiring again. However, we can to that check as often as the heartbeats.
pinged_at + heartbeat_interval == last_message + 2 heartbeat_interval
no, that's not accurate. A timeline assuming 1000 ms heartbeat interval (with exaggerated delays for demonstration):
I like @bilderbuchi's last proposal of the general protocol, except that I would, in general, be a bit more lenient with the heartbeats, and only sever the connection in case a few heartbeat intervals have elapsed, not just one ping sent without a pong response, but maybe 2 or 3.
Alternatively, I would give the Components the option, similar to what @bmoneke suggested, to announce (maybe in the SIGNIN
message), what their preferred heartbeat interval would be - maybe a particular Component knows that it will be unresponsive for longer intervals than the default heartbeat interval, and it does not have the option to just reduce its own heartbeat interval to accommodate the other side wanting to go faster, and thus speed up.
So, we could have default heartbeat interval X, but a Component might say that heartbeats should not be expected to come from it more often than every Y second, which may well be larger than X.
In the #35 Status handling, we could flesh out a status level where "the Coordinator becomes suspicious, but does not yet sever the connection outright".
- 2000: 2 heartbeat intervals have passed since the last message, but only half an interval since the PING. Using your logic, the connection will now be severed, but my logic gives a grace period of one heartbeat
This depends no, how often you check the expiration. I proposed to use the heartbeat interval in that message.
Maybe we should start with the goals: What do we want to achieve with the heartbeat?
@bklebel mentioned slow reacting Components, which influences the Actor design:
I thought about single threaded actors, which handle a message at a time. If the handling takes time (due to device communication), there won't be any heartbeats nor any responses to ping requests. The advantage is, that the actors are simple in design: read a message, handle it, respond, repeat.
Another proposal (from the cern middle ware paper) assumes a thread dedicated to message handling. This assures regular heartbeats, fast responses to pings etc. It requires also a queue (in the English sense) for commands and some way to store the a message information during handling etc.
If we assume the first, you can still write a actor of the second type (for example for a very slow instrument etc).
We could leave it up to the user and define it generally:
Then the users can set their intervals suiting their setup.
As we have the ping message, it is no problem, if the heartbeat rate is too low in one Component, as the other side can ping.
Just the expiration time has to be larger than the slowest component connected to that Coordinator.
I like @bilderbuchi's last proposal of the general protocol, except that I would, in general, be a bit more lenient with the heartbeats, and only sever the connection in case a few heartbeat intervals have elapsed, not just one ping sent without a pong response, but maybe 2 or 3.
Alternatively, I would give the Components the option, similar to what @bmoneke suggested, to announce (maybe in the
SIGNIN
message), what their preferred heartbeat interval would be - maybe a particular Component knows that it will be unresponsive for longer intervals than the default heartbeat interval, and it does not have the option to just reduce its own heartbeat interval to accommodate the other side wanting to go faster, and thus speed up. So, we could have default heartbeat interval X, but a Component might say that heartbeats should not be expected to come from it more often than every Y second, which may well be larger than X.In the #35 Status handling, we could flesh out a status level where "the Coordinator becomes suspicious, but does not yet sever the connection outright".
:+1: to all these points.
Also, I caution against too much configurability/parameters -- in this first iteration, we can still keep the logic simple, and not yet account for all use cases and scenarios. Later, we can adjust that and expand with more options if needed.
Also, I caution against too much configurability/parameters -- in this first iteration, we can still keep the logic simple, and not yet account for all use cases and scenarios. Later, we can adjust that and expand with more options if needed.
I fully agree. We could start with a default heartbeat interval for all, and introduce "personalised" heartbeat intervals later - if we are lenient enough with severing connections, we should not run into troubles too quickly, although we will detect failures a bit slower, would be fine with me.
slow reacting Components, which influences the Actor design
That will be another story altogether, I don't yet have a clear picture of how that will go. I would however be careful with having a dedicated thread to send heartbeats, this might lead to heartbeats being sent even though the thread which deals with hardware connections died and was not restarted, so we get inaccurate information. This could be prevented in a good implementation, but if it can be excluded by design, that might be better.
Let us discuss on heartbeats in this issue. Related to #4
Accepted ideas:
Some ideas to get heartbeats: