pymeasure / leco-protocol

Design notes for a generic communication protocol to control experiments and measurement hardware
https://leco-laboratory-experiment-control-protocol.readthedocs.io
MIT License
6 stars 3 forks source link

Component-Coordinator Transport Layer Protocol #32

Closed BenediktBurger closed 1 year ago

BenediktBurger commented 1 year ago

As mermaid diagrams are not rendered in a PR, I collect the protocol definitions here.

The Message Layer will define, how the commands are encoded, here they are in plan English. How the Header is formatted, will be defined in #33

General notes:

Connection

erDiagram
    Component }|--|| Coordinator : "DEALER connects to ROUTER"
    Coordinator {
        string address
        string namespace
    }
    Component {
        string ID
    }

address is for example protocol, host, and port.

Basic communication

basic communication (connect/disconnect, heartbeat)

Successful communication

sequenceDiagram
    Note over CA,Co1: Initial communication
    CA ->> Co1: || I connect
    Note right of Co1: Stores CA's address in its list
    Co1 ->> CA: CA||Welcome to namespace "Co1" and here are relevant infos
    Note left of CA: Stores "C1" as its namespace.
    Note over CA,Co1: Some time later, a heartbeat
    CA ->> Co1: ||ping
    Note right of Co1: Updates heartbeat time.
    Co1 ->> CA: CA||pong
    Note left of CA: Updates hearbeat time.
    Note over CA,Co1: Some communication
    CA ->> Co1: Co2|CB||Some message for someone else.
    Note right of Co1: Updates heartbeat time.
    Co1 ->> CA: CA||pong
    Note right of Co1: Sends message to CB via Co2
    Note left of CA: Updates heartbeat time
    Note over CA,Co1: End of communication
    CA ->> Co1: || I disconnect from you.
    Co1 ->> CA: CA|| Acknowledge.
    Note right of Co1: Deletes CA from address list.

Notes:

Different unsuccessful communication parts

sequenceDiagram
    Note over CA,Co1: Name already used: zmq.connect raises error
    Note over CA,Co1: The CA was known, but did not send a message in a long time
    Co1 ->> CA: CA|| Are you still alive?
    Note left of CA: Does not respond.
    Note right of Co1: Deletes CA from address list.
    Note over CA,Co1: TBD: The CA was known, but did not send a message in a long time
    Note right of Co1: Deletes "CA" from address list.
    CA ->> Co1: R:"C1.CA2". S:"C1.CA". Some communication for someone else.
    Note right of Co1: Stores "CA in its address list.
    Co1 ->> CA: R:"C1.CA". S:"C1.Co1". Acknowledge. The namespace is "C1".
    Note right of Co1: Handles the communication to CA 2
    Note left of CA: Updates "C1 as its namespace.
    Note over CA,Co1: Unknown recipient
    CA ->> Co1: Co1|CB|| Some message.
    Note right of Co1: Does not know CA3.
    Co1 ->> CA: CA|| Error: I do not know "CA3".

Components should request a heartbeat (by sending one themselves) before the time expires.

Message exchange

Message exchange in one Coordinator

sequenceDiagram
    CA ->> Co1: Co1|CB|| Give me property A.
    Co1 ->> CA: CA|| Acknowledge.
    Co1 ->> CB: CB|| Give me property A. ||Co1|CA
    CB ->> Co1: Co1|CA|| Property A has value 5.
    Co1 ->> CB: CB|| Acknowledge.
    Co1 ->> CA: CA|| Property A has value 5. ||Co1|CB
    Note over CA,Co1: As first message would work equally a local namespace:
    CA ->> Co1: CB|| Give me property A.

Notes:

Questions:

Message exchange with two Coordinators.

sequenceDiagram
    CA ->> Co1: Co2|CB|| Give me property A.
    Co1 ->> CA: CA|| Acknowledge.
    Co1 ->> Co2: Co2|CB|| Give me property A.||Co1|CA
    Co2 ->> CB: CB|| Give me property A.||Co1|CA
    CB ->> Co2: Co1|CA|| Property A has value 5
    Co2 ->> CB: CB|| Acknowledge.
    Co2 ->> Co1: Co1|CA|| Property A has value 5||Co2||CB
    Co1 ->> CA: CA|| Property A has value 5||Co2||CB

During the whole exchange, the conversation ID is the same.

bilderbuchi commented 1 year ago

As mermaid diagrams are not rendered in a PR, I collect the protocol definitions here.

When we set up CI (#13 ) we can enable that RTD renders docs for PRs, too (https://docs.readthedocs.io/en/stable/pull-requests.html). I think that should make it possible to at least inspect the results.

bilderbuchi commented 1 year ago

Don't you want to abbreviate, e.g. Coordinator Co1, Co2, and Components C1, C2,... (or CA, CB,...) -- saves a lot of typing and space in the diagrams?

Successful communication: Some communication.

The message "R:"C1.Component". S:"C1.Coordinator". Acknowledge.", I think, should not be an ACK, but the reply from C1.Component2. Otherwise, this communication is now over without C1 getting the reply it is actually interested in?

Also, is the namespace of a Coordinator not the same as its name? Or do you want to treat the namespaces differently?

A disconnect message has the same consequences as no message during hearbeat time. However, a disconnect makes the name available again for another Component.

I think the Coordinator should at least ask with GET_STATUS once, before it disconnects.

Any message serves as a "connect" message

I disagree. The first message has to be a CONNECT. Otherwise, we end up mixing commands and their replies, and/or the protocol just gets needlessly complicated. Also, a Component will not know the namespace yet. Also, a Component that just connected does not even know which other Components are available, as it did not receive the address list yet. etc etc

bilderbuchi commented 1 year ago

Name already used

That one has a funny hole. In the reply, we are using R:".Component", but this Component already exists, so this message will go to the wrong Component! I guess we need a different flow for the establishing a connection.

The Component was known, but did not send a messag in a long time

I already mentioned the problem with implicitly establishing connections.

Components should request a heartbeat (by sending one themselves) before the time expires.

I'm not sure. I'm OK with sending heartbeats out regularly, but I don't think one should get a reply back. We should check how other protocols handle this. If you want to know if someone is alive (but are not sure) you should ask for GET_STATUS or a separate GET_ALIVE. The latter has the added benefit that a Component can now realise that its heartbeats have not been heard in time, and tweak the interval.

bilderbuchi commented 1 year ago

Message exchange in one Coordinator As first message would work equally a local namespace (.):

You mean the first message in the shown exchange? Or the first after connection? (I assume the former)

We already talked about the additional ACKs, and message symmetry elsewhere, but I'm not through with my notifications, yet.

Should we allow "local mode", i.e. without specifying any namespace (recipient only, not sender!)?

This feels attractive for single Node setups, to not needlessly prefix the coordinator namespace all the time. Have we discarded the notion that a Coordinator strips its name from a namespace when sending locally?

If we allow local namespace-less addresses, we should be consistent:

I think this should then be transparent and consistent for single node setups, even ones that grow into multinode later.

BenediktBurger commented 1 year ago

That one has a funny hole. In the reply, we are using R:".Component", but this Component already exists, so this message will go to the wrong Component! I guess we need a different flow for the establishing a connection.

In the example communication, I did show only the frames actually sent. But that is not the whole truth: A ROUTER socket (used in the Communicator) prepends every received message with an address (some bytes value, say "vioasdf"). If you send a message with the ROUTER socket, you have to prepend the data you want to send with that address, such that the zmq magic knows, to whom to send the data. So actually you call send_multipart(["vioasdf", data_frame0, data_frame1...]). The Coordinators keep a list of known Sender names and the corresponding addresses (the local part of the "address book"). Therefore, you can always respond to any connected peer, if you know the address. Therefore, the Coordinator is able to respond to the Component usurping the name "Component", that the name is already taken, because it knows the address of the usurper (from the message it received).

bilderbuchi commented 1 year ago

Message exchange with two Coordinators.

Should Coordinators acknowledge to each other the reception of a message

IMO, no, ACKs should only (primarily?) be for messages that would otherwise not get a reply. The reception of the reply is the acknowledgement. If no reply comes, you know something went wrong, and can retry and/or notify upstream Components. E.g.

sequenceDiagram
    CA ->> Coord1: R:"C2.CB". S:"C1.CA". Give me property A.
    Coord1 ->> Coord2: R:"C2.CB". S:"C1.CA". Give me property A.
    Coord2 ->> CB: R:"C2.CB". S:"C1.CA". Give me property A.
    Note over CB: No response/timeout
    CB -->> Coord2: <missing message>
    Coord2 ->> Coord1: R:"C1.CA". S:"C2.CB". Error: C2.CB did not respond
    Coord1 ->> CA: R:"C1.CA". S:"C2.CB". Error: C2.CB did not respond
bilderbuchi commented 1 year ago

In the example communication, I did show only the frames actually sent. But that is not the whole truth:

Ah, devil's in the details! All clear! The usurper could then react by reporting with another, mutated, name. To avoid a back and forth with _1, _2, _3 suffixes, the Coordinator could even reply with a suggestion it knows is still free: Why don't you call yourself "Component_42", instead?

BenediktBurger commented 1 year ago

Also, is the namespace of a Coordinator not the same as its name? Or do you want to treat the namespaces differently?

I thought, that we could name a Coordinator just "Coordinator", as it is unique in its namespace. Therefore you can always address your personal Coordinator if you do not supply any namespace, regardless of the namespace.

I think the Coordinator should at least ask with GET_STATUS once, before it disconnects.

You mean, instead of dropping a name, it sends a "are you still alive?" message. and if no reply arrives, it is removed from the list? Good idea. So you give code a chance to respond, if they forgot their heartbeat.

I disagree. The first message has to be a CONNECT. Otherwise, we end up mixing commands and their replies, and/or the protocol just gets needlessly complicated.

Due to heartbeats and incoming messages (which you cannot control), you have always the risk to receive another message than between sending a request and receiving a reply.

Also, a Component will not know the namespace yet.

Yes, but you can already send local messages.

Also, a Component that just connected does not even know which other Components are available, as it did not receive the address list yet. etc etc

But the user might know the Components name, he wants to connect to.

Another question:

The Component was known, but did not send a messag in a long time

With that sentence, I meant, that the Component did not send any heartbeat some time.

You mean the first message in the shown exchange? Or the first after connection? (I assume the former)

I wanted to give an example of "local" communication without specifying the namespace.

BenediktBurger commented 1 year ago

Do we have to use the leading period in the address (even without namespace)? It feels weird/ugly, and doesn't help with zmq topic filtering iiuc.

No. The leading period is not necessary, we could decide to drop it altogether. For the data protocol (topic filtering) we should use the full name.

The usurper could then react by reporting with another, mutated, name. To avoid a back and forth with _1, _2, _3 suffixes, the Coordinator could even reply with a suggestion it knows is still free: Why don't you call yourself "Component_42", instead?

I did not think about that, as I thought, that humans give the names, but that is an idea.

BenediktBurger commented 1 year ago

Have we discarded the notion that a Coordinator strips its name from a namespace when sending locally?

I started an issue regarding that in #27 , from the considerations given there, I prefer to use always the full name, and used it in the examples, but that is not yet decided.

BenediktBurger commented 1 year ago

If no reply comes, you know something went wrong, and can retry and/or notify upstream Components. E.g

I would not put the burden of checking for an answer onto the Coordinator, as it does not know, whether an answer is required.

bilderbuchi commented 1 year ago

You mean, instead of dropping a name, it sends a "are you still alive?" message. and if no reply arrives, it is removed from the list? Good idea. So you give code a chance to respond, if they forgot their heartbeat.

Exactly.

But the user might know the Components name, he wants to connect to.

We are trying to specify the protocol, though, with as little as possible relying on user capability. ;-)

Due to heartbeats and incoming messages (which you cannot control), you have always the risk to receive another message than between sending a request and receiving a reply.

Yeah, but then you have different conversation IDs for different "topics", and a GET is a different thing from a CONNECT, why mix it up. Also, your protocol state machine gets easier if it starts with one option, a CONNECT, not any message?!

Yes, but you can already send local messages.

I think we need to decide if we always use the full addresses or not, for this.

If a "connect" is necessary, how do we deal with a dying (and restarting) Coordinator? Without the "connect" message, everything would continue as usual.

I fear I don't understand. If the coordinator is "dead", how can everything continue as usual? Aren't all the connections dead? It did not send heartbeats. How does the CONNECT message from a Component come into play here? Also, if a Coordinator dies, we are in deep shit already, anyway, no? :D

I did not think about that, as I thought, that humans give the names, but that is an idea.

Thanks. Sure humans can do that, but thinking of pymeasure, people also leave their instrument names alone most of the time, and it will be nice if we automatically disambiguate.

No. The leading period is not necessary, we could decide to drop it altogether. For the data protocol (topic filtering) we should use the full name.

:+1:

I would not put the burden of checking for an answer onto the Coordinator, as it does not know, whether an answer is required.

Elsewhere we talked about that a message always requires a reply (even if it is null) - I thought that to be the original purpose of the ACK - a reply in case no data/content is expected.

bilderbuchi commented 1 year ago

oh man, multi-parallel processing of discussion points :sweat: time for dinner soon :grin:

BenediktBurger commented 1 year ago

I fear I don't understand. If the coordinator is "dead", how can everything continue as usual?

If a Coordinator is restarted (due to being an OS service etc.), all the Components reconnect automatically (in Zmq), without knowing, that they reconnected. Due to constant heartbeats, the Connector rebuilds its address book fast and can route messages easily. Maybe a few messages will get rejected, but not all.

If we require a new "connect" message, all Components have to take an action.

bilderbuchi commented 1 year ago

If a Coordinator is restarted (due to being an OS service etc.), all the Components reconnect automatically (in Zmq), without knowing, that they reconnected. Due to constant heartbeats, the Connector rebuilds its address book fast and can route messages easily. Maybe a few messages will get rejected, but not all.

If we require a new "connect" message, all Components have to take an action.

OK, I think we are maybe talking about two different "connect" events. You are talking (afaict) about the zmq connection, which automatically gets reconnected. I was talking about exchanging the necessary info for a Component to interoperate with a Coordinator in LECO -- the address book, avro schemas, the Node's namespace, handshake stuff, whatever might come later.

If the Component does not even realise that the connection was gone for a while, indeed, why would it need a new CONNECT? However, at the first time it connects (also after it restarts), it needs some info (currently, mainly the address book and avro handshake), and that I would like to handle in a separate message exchange, not interspersed with regular control messages.