Closed BenediktBurger closed 1 year ago
As mermaid diagrams are not rendered in a PR, I collect the protocol definitions here.
When we set up CI (#13 ) we can enable that RTD renders docs for PRs, too (https://docs.readthedocs.io/en/stable/pull-requests.html). I think that should make it possible to at least inspect the results.
Don't you want to abbreviate, e.g. Coordinator Co1, Co2, and Components C1, C2,... (or CA, CB,...) -- saves a lot of typing and space in the diagrams?
Successful communication: Some communication.
The message "R:"C1.Component". S:"C1.Coordinator". Acknowledge.", I think, should not be an ACK, but the reply from C1.Component2. Otherwise, this communication is now over without C1 getting the reply it is actually interested in?
Also, is the namespace of a Coordinator not the same as its name? Or do you want to treat the namespaces differently?
A disconnect message has the same consequences as no message during hearbeat time. However, a disconnect makes the name available again for another Component.
I think the Coordinator should at least ask with GET_STATUS
once, before it disconnects.
Any message serves as a "connect" message
I disagree. The first message has to be a CONNECT
. Otherwise, we end up mixing commands and their replies, and/or the protocol just gets needlessly complicated. Also, a Component will not know the namespace yet. Also, a Component that just connected does not even know which other Components are available, as it did not receive the address list yet. etc etc
Name already used
That one has a funny hole. In the reply, we are using R:".Component"
, but this Component already exists, so this message will go to the wrong Component! I guess we need a different flow for the establishing a connection.
The Component was known, but did not send a messag in a long time
I already mentioned the problem with implicitly establishing connections.
Components should request a heartbeat (by sending one themselves) before the time expires.
I'm not sure. I'm OK with sending heartbeats out regularly, but I don't think one should get a reply back. We should check how other protocols handle this.
If you want to know if someone is alive (but are not sure) you should ask for GET_STATUS
or a separate GET_ALIVE
. The latter has the added benefit that a Component can now realise that its heartbeats have not been heard in time, and tweak the interval.
Message exchange in one Coordinator As first message would work equally a local namespace (.):
You mean the first message in the shown exchange? Or the first after connection? (I assume the former)
We already talked about the additional ACKs, and message symmetry elsewhere, but I'm not through with my notifications, yet.
Should we allow "local mode", i.e. without specifying any namespace (recipient only, not sender!)?
This feels attractive for single Node setups, to not needlessly prefix the coordinator namespace all the time. Have we discarded the notion that a Coordinator strips its name from a namespace when sending locally?
If we allow local namespace-less addresses, we should be consistent:
ComponentB
is local, Coordinator2.ComponentB
is another, remote ComponentI think this should then be transparent and consistent for single node setups, even ones that grow into multinode later.
That one has a funny hole. In the reply, we are using R:".Component", but this Component already exists, so this message will go to the wrong Component! I guess we need a different flow for the establishing a connection.
In the example communication, I did show only the frames actually sent. But that is not the whole truth:
A ROUTER socket (used in the Communicator) prepends every received message with an address (some bytes value, say "vioasdf"). If you send a message with the ROUTER socket, you have to prepend the data you want to send with that address, such that the zmq magic knows, to whom to send the data. So actually you call send_multipart(["vioasdf", data_frame0, data_frame1...])
.
The Coordinators keep a list of known Sender names and the corresponding addresses (the local part of the "address book").
Therefore, you can always respond to any connected peer, if you know the address. Therefore, the Coordinator is able to respond to the Component usurping the name "Component", that the name is already taken, because it knows the address of the usurper (from the message it received).
Message exchange with two Coordinators.
Should Coordinators acknowledge to each other the reception of a message
IMO, no, ACKs should only (primarily?) be for messages that would otherwise not get a reply. The reception of the reply is the acknowledgement. If no reply comes, you know something went wrong, and can retry and/or notify upstream Components. E.g.
sequenceDiagram
CA ->> Coord1: R:"C2.CB". S:"C1.CA". Give me property A.
Coord1 ->> Coord2: R:"C2.CB". S:"C1.CA". Give me property A.
Coord2 ->> CB: R:"C2.CB". S:"C1.CA". Give me property A.
Note over CB: No response/timeout
CB -->> Coord2: <missing message>
Coord2 ->> Coord1: R:"C1.CA". S:"C2.CB". Error: C2.CB did not respond
Coord1 ->> CA: R:"C1.CA". S:"C2.CB". Error: C2.CB did not respond
In the example communication, I did show only the frames actually sent. But that is not the whole truth:
Ah, devil's in the details! All clear! The usurper could then react by reporting with another, mutated, name. To avoid a back and forth with _1, _2, _3 suffixes, the Coordinator could even reply with a suggestion it knows is still free: Why don't you call yourself "Component_42", instead?
Also, is the namespace of a Coordinator not the same as its name? Or do you want to treat the namespaces differently?
I thought, that we could name a Coordinator just "Coordinator", as it is unique in its namespace. Therefore you can always address your personal Coordinator if you do not supply any namespace, regardless of the namespace.
I think the Coordinator should at least ask with GET_STATUS once, before it disconnects.
You mean, instead of dropping a name, it sends a "are you still alive?" message. and if no reply arrives, it is removed from the list? Good idea. So you give code a chance to respond, if they forgot their heartbeat.
I disagree. The first message has to be a CONNECT. Otherwise, we end up mixing commands and their replies, and/or the protocol just gets needlessly complicated.
Due to heartbeats and incoming messages (which you cannot control), you have always the risk to receive another message than between sending a request and receiving a reply.
Also, a Component will not know the namespace yet.
Yes, but you can already send local messages.
Also, a Component that just connected does not even know which other Components are available, as it did not receive the address list yet. etc etc
But the user might know the Components name, he wants to connect to.
Another question:
The Component was known, but did not send a messag in a long time
With that sentence, I meant, that the Component did not send any heartbeat some time.
You mean the first message in the shown exchange? Or the first after connection? (I assume the former)
I wanted to give an example of "local" communication without specifying the namespace.
Do we have to use the leading period in the address (even without namespace)? It feels weird/ugly, and doesn't help with zmq topic filtering iiuc.
No. The leading period is not necessary, we could decide to drop it altogether. For the data protocol (topic filtering) we should use the full name.
The usurper could then react by reporting with another, mutated, name. To avoid a back and forth with _1, _2, _3 suffixes, the Coordinator could even reply with a suggestion it knows is still free: Why don't you call yourself "Component_42", instead?
I did not think about that, as I thought, that humans give the names, but that is an idea.
Have we discarded the notion that a Coordinator strips its name from a namespace when sending locally?
I started an issue regarding that in #27 , from the considerations given there, I prefer to use always the full name, and used it in the examples, but that is not yet decided.
If no reply comes, you know something went wrong, and can retry and/or notify upstream Components. E.g
I would not put the burden of checking for an answer onto the Coordinator, as it does not know, whether an answer is required.
You mean, instead of dropping a name, it sends a "are you still alive?" message. and if no reply arrives, it is removed from the list? Good idea. So you give code a chance to respond, if they forgot their heartbeat.
Exactly.
But the user might know the Components name, he wants to connect to.
We are trying to specify the protocol, though, with as little as possible relying on user capability. ;-)
Due to heartbeats and incoming messages (which you cannot control), you have always the risk to receive another message than between sending a request and receiving a reply.
Yeah, but then you have different conversation IDs for different "topics", and a GET
is a different thing from a CONNECT
, why mix it up. Also, your protocol state machine gets easier if it starts with one option, a CONNECT, not any message?!
Yes, but you can already send local messages.
I think we need to decide if we always use the full addresses or not, for this.
If a "connect" is necessary, how do we deal with a dying (and restarting) Coordinator? Without the "connect" message, everything would continue as usual.
I fear I don't understand. If the coordinator is "dead", how can everything continue as usual? Aren't all the connections dead? It did not send heartbeats. How does the CONNECT message from a Component come into play here? Also, if a Coordinator dies, we are in deep shit already, anyway, no? :D
I did not think about that, as I thought, that humans give the names, but that is an idea.
Thanks. Sure humans can do that, but thinking of pymeasure, people also leave their instrument names alone most of the time, and it will be nice if we automatically disambiguate.
No. The leading period is not necessary, we could decide to drop it altogether. For the data protocol (topic filtering) we should use the full name.
:+1:
I would not put the burden of checking for an answer onto the Coordinator, as it does not know, whether an answer is required.
Elsewhere we talked about that a message always requires a reply (even if it is null) - I thought that to be the original purpose of the ACK
- a reply in case no data/content is expected.
oh man, multi-parallel processing of discussion points :sweat: time for dinner soon :grin:
I fear I don't understand. If the coordinator is "dead", how can everything continue as usual?
If a Coordinator is restarted (due to being an OS service etc.), all the Components reconnect automatically (in Zmq), without knowing, that they reconnected. Due to constant heartbeats, the Connector rebuilds its address book fast and can route messages easily. Maybe a few messages will get rejected, but not all.
If we require a new "connect" message, all Components have to take an action.
If a Coordinator is restarted (due to being an OS service etc.), all the Components reconnect automatically (in Zmq), without knowing, that they reconnected. Due to constant heartbeats, the Connector rebuilds its address book fast and can route messages easily. Maybe a few messages will get rejected, but not all.
If we require a new "connect" message, all Components have to take an action.
OK, I think we are maybe talking about two different "connect" events. You are talking (afaict) about the zmq connection, which automatically gets reconnected. I was talking about exchanging the necessary info for a Component to interoperate with a Coordinator in LECO -- the address book, avro schemas, the Node's namespace, handshake stuff, whatever might come later.
If the Component does not even realise that the connection was gone for a while, indeed, why would it need a new CONNECT
?
However, at the first time it connects (also after it restarts), it needs some info (currently, mainly the address book and avro handshake), and that I would like to handle in a separate message exchange, not interspersed with regular control messages.
As mermaid diagrams are not rendered in a PR, I collect the protocol definitions here.
The Message Layer will define, how the commands are encoded, here they are in plan English. How the Header is formatted, will be defined in #33
General notes:
Connection
address is for example protocol, host, and port.
Basic communication
basic communication (connect/disconnect, heartbeat)
Successful communication
Notes:
Different unsuccessful communication parts
Components should request a heartbeat (by sending one themselves) before the time expires.
Message exchange
Message exchange in one Coordinator
Notes:
Questions:
Message exchange with two Coordinators.
During the whole exchange, the conversation ID is the same.