bilderbuchi commented 1 year ago

Let us start an informal list of "command verbs"/message types that we need within ECP. This way, we get a feeling about the set of commands we need and how they apply to different Components. I have here used the term "message type" to also refer to the various commands of the control protocol, as the boundaries can be diffuse, see e.g. KNOWN_COMPS. I'm open that we restructure that if needed.

This should not be considered formally specified yet, but serve as a basis for what we later add to the specification or to find out how well Avro fits our needs. Let's put the command verb/message type in CAPS, and any arguments between <angle brackets>, and specify if a return message is required (disregarding transport-layer acknowledgements). This is not the specific syntax yet, so don't get hung up on separator choices, etc.

Feel free to edit the comment as needed.

Housekeeping

REQ_STATUS: Request current status of a Component. Reply: STATUS <status description>
ERROR <error level, error description>: Message detailing an error, error categories TBD. Can also be a reply to a message.
LOG <log level, log message>: Log message including log level (levels TBD).
(HEARTBEAT): Every message serves as heartbeat. If an explicit heartbeat is desired, we can send an empty message.
Routing
SIGN_IN: Announce the presence of the sending Component, and request registration with the targeted Coordinator. Reply: Outcome of the sign-in attempt.
SIGN_OUT: Request the deregistration of the sending Component from the targeted Coordinator. Reply: If accepted, confirm deregistration. This will be the last message from the Coordinator to that Component in that connection's lifetime.
LIST_KNOWN_COMPS: Instruct a Coordinator to send the list of Components it knows (both Node-local and distributed). Reply: see next.
KNOWN_COMPS <Coordinatorname> <list of component IDs>: This might also be sent out upon Component connection, so it's not strictly a "command" or a direct reply to a command.

Control

ACKNOWLEDGE: Acknowledge a received message or correct execution of a command.
GET <parametername>: Request the named Parameter's most recent value from an Actor. Reply: The value (and possibly the name).
SET <parametername> <value>: Set the named Parameter of an Actor to the passed value.
CALL <actionname> [args]: Call the named action of an Actor, using 0 or more arguments. TBD how to deal with (lack of) return values.
LIST_PARAMS/LIST_ACTIONS might not be necessary if we use Avro as the "schema" of an entity is part of the connection handshake.
START_POLLING <interval_ms> [1+ parameter names]: Command an Actor to fetch (and publish) fresh values for these parameters at the given interval. Interval given first as we have variable nr of arguments afterwards.
STOP_POLLING [1+ parameter names] Stop polling these parameters.
Data
DATA [Dict with 1+ parameter-value pairs] Published Parameter values. TBC if message payload should include Component ID.
LOCK/UNLOCK/FORCE_UNLOCK: Reserve a resource (Driver or part of a Driver, e.g. pymeasure instrument channel) or release the lock (only from originally locking party). Anybody may use FORCE_UNLOCK (in case the original party died).

BenediktBurger commented 1 year ago

A few questions:

How do we mark an answer of, for example Get? Previously I used Set as the answer to Get. Another option is to keep the name (and distinguish via conversation ID / reply reference), or to have distinctive commands like Get and Get_Reply.
do we want to just get / set just one value or list of values?
do we want to accept args for get/set? For example for a forced update (with cached values). If we get/set just one variable, args are easy to implement, I'd say.
what do you intend with the log message? I think logging should use the data protocol, because a Component does not know, where to send the logs specifically and via the data protocol a logging facility could subscribe. I open a separate issue.

A few ideas / answers:

Regarding Call return value: We should always return anything, be it void/None...
regarding data: the first frame (component ID) is used for filtering and is sent, therefore the payload does not need that information again.
regarding list of known components: we could do it with a get command. Similarly the status.
I like to keep names consistent, therefore I would start all command, which try to get something, with get (GET_STATUS, GET_COMPONENTS...) instead of using REQ or something else.

BenediktBurger commented 1 year ago

Should we interpret an empty message as acknowledgment? At least as an "message received and also heartbeat" acknowledgment. That would be good for a "reply to every message, at least with a heartbeat" heartbeat pattern, see #4 .

Remember: zmq messages arrive either complete or not, so if we get the header frame, we know, that we have all frames. Therefore, there is no possibility to interpret a partially received message as an acknowledgment.

bilderbuchi commented 1 year ago

* How do we mark an answer of, for example Get? Previously I used Set as the answer to Get. Another option is to keep the name (and distinguish via conversation ID / reply reference), or to have distinctive commands like Get and Get_Reply.

We could use the DATA <name> <value> message type, just over the control channel?

* do we want to just get / set just one value or list of values?

More than one would certainly be useful! At least for SET it would be a dict. We just have one message over the network, the Actor asks the Driver for a number of updates, and sends back the reply.

* do we want to accept args for get/set? For example for a forced update (with cached values). If we get/set just one variable, args are easy to implement, I'd say.

That might be useful, we'll have to find a good way to specify the format. This also depends on the API of Actor -- if we are dealing only with the one argument from_cache (or force, or whatever) it's probably easier to have an additional command GET_FRESH or somesuch.

* what do you intend with the log message? I think  logging should use the data protocol, because a Component does not know, where to send the logs specifically and via the data protocol a logging facility could subscribe. I open a separate issue.

Yeah, logs should be emitted/published into some logging stream. Bonus points iiuc: if nobody's subscribed, no message is emitted). If that's on the data protocol or a separate one, I can't decide now.

A few ideas / answers:

* Regarding Call return value: We should always return anything, be it void/None...

Seems reasonable. An ACK might be in order. We'll probably also need a NULL value/indicator (maybe look in Avro, first).

* regarding data: the first frame (component ID) is used for filtering **and** is sent, therefore the payload does not need that information again.

Yeah, however we might also mutate that info in the header (add/remove Coordinator info), and it could be attractive to keep the payload alone meaningful on its own (i.e. without the message metadata) -- on their way through the system/code, at some point the metadata might be stripped off.

* regarding list of known components: we could do it with a get command. Similarly the status.

I'd keep this separated as this is for separate concerns. Why should we mix Parameter updates with housekeeping updates? What's the attraction of having one less message type/command, but "multiplexing/overloading" another?

* I like to keep names consistent, therefore I would start all command, which try to get something, with get (GET_STATUS, GET_COMPONENTS...) instead of using REQ or something else.

yeah, why not.

Should we interpret an empty message as acknowledgment? At least as an "message received and also heartbeat" acknowledgment.

I don't think so; we should have a distinct ACK-type message. Otherwise, if you send a command, and receive an empty message -- was that the regular Component heartbeat? An acknowledgement of a command? Or do you prefer to resolve this from message metadata (reply-reference/conversation-id)?

bilderbuchi commented 1 year ago

Added START/STOP_POLLING

BenediktBurger commented 1 year ago

Here again some notable differences of the DataProtocol (PUB-SUB) in regard to the Control Protocol:

The Data Protocol Coordinators (let's call them Proxy), do not do anything with the messages, except passing on (equally they pass on subscription/unsubscription requests).
We could invent new Proxies, which hand on messages alternating their header, but I do not want to write that code, if there is a really good and reliable solution (call zmq.proxy() with two sockets as parameters), besides, that I do not see the benefit.
Messages are only sent, if someone subscribed (you understood correctly)
Only the first frame is meaningful as some sending topic, the rest is payload.

Therefore, we have to treat the protocols differently (in the control protocol, we separate header from payload, in the data protocol, it is one message

BenediktBurger commented 1 year ago

I'd keep this separated as this is for separate concerns. Why should we mix Parameter updates with housekeeping updates? What's the attraction of having one less message type/command, but "multiplexing/overloading" another?

I see the Components list and the Status as a property of the Coordinator/Component. If I want a property, I call GET.

Null/None

Avro has "null"

I don't think so; we should have a distinct ACK-type message. Otherwise, if you send a command, and receive an empty message -- was that the regular Component heartbeat? An acknowledgement of a command? Or do you prefer to resolve this from message metadata (reply-reference/conversation-id)?

The idea is, that the "message received" answer, is just a heartbeat (the other side know, you're still alive). A command or somesouch should be acknowledged by a specific message.

We could say, that the Component may send a "ping" (content may be "null", not empty) message to the Coordinator, which responds with its typical answer (empty frame). That makes it easy: If a message does not contain content, do not respond. If it contains a message, either respond with some answer to the question, or answer an empty message. This prevents an infinite heartbeat chain.

bilderbuchi commented 1 year ago

I see the Components list and the Status as a property of the Coordinator/Component. If I want a property, I call GET.

Well, but (so far), a GET was for a "Parameter" (as in "a property (in the English, not the Pythonic sense) of the Driver represented by a Actor."), not for any property/quantity. What if a Driver implements a property called known_components? Do we want to have to forbid all Parameter names that collide with our housekeeping fields? These concerns are separate, and should stay separate. Right now I don't see an overriding advantage of munging these two together. Do you?

If you want to absolutely use GET to get housekeeping properties, we can do GET_PARAM for the driver's properties, but that's the same outcome as just using a specific command for the housekeeping stuff :shrug:

BenediktBurger commented 1 year ago

Sorry, I missed all the "Actors" and always thought about any Component.

bilderbuchi commented 1 year ago

A command or somesouch should be acknowledged by a specific message.

Why is that? Isn't the expected reply acknowledgement (enough) for a specific message?

The idea is, that the "message received" answer, is just a heartbeat (the other side know, you're still alive). A command or somesouch should be acknowledged by a specific message.

We could say, that the Component may send a "ping" (content may be "null", not empty) message to the Coordinator, which responds with its typical answer (empty frame). That makes it easy: If a message does not contain content, do not respond. If it contains a message, either respond with some answer to the question, or answer an empty message. This prevents an infinite heartbeat chain.

So, to understand what you wrote correctly (and maybe a little in "devil's advocate" style):

Every "normal" command message should trigger _two_replies (the first one empty, the second one the "actual" reply)
The component must not answer to the first message
The component answers the second of those replies with an empty message back (as it is not a question, so there's no "answer to the question").
A ping does not follow that pattern -- it is a message with content, but the reply is an empty message, not to be answered again.

Do we really need a ratio "normal content": acks of 1:1? A "standard" command exchange causes 4 messages? Are the response times so bad that we want our Components to have an ACK before they get the proper reply half a second or second or so later?

bilderbuchi commented 1 year ago

Here again some notable differences of the DataProtocol (PUB-SUB) in regard to the Control Protocol:

* The Data Protocol Coordinators (let's call them Proxy), do not do anything with the messages, except passing on (equally they pass on subscription/unsubscription requests).

* We could invent new Proxies, which hand on messages alternating their header, but I do not want to write that code, if there is a really good and reliable solution (call `zmq.proxy()` with two sockets as parameters), besides, that I do not see the benefit.

* Messages are only sent, if someone subscribed (you understood correctly)

* Only the first frame is meaningful as some sending topic, the rest is payload.

Therefore, we have to treat the protocols differently (in the control protocol, we separate header from payload, in the data protocol, it is one message

thanks for that, that's illuminating. I agree with the notion of not reinventing stuff! Could/should the Data Protocol Proxies live inside a Coordinator, or do you want to create a separate Component for that? The former seems reasonable from my point of view, fewer addresses/entities to keep track of, and the Coordinator could set up the details with the Component when establishing the control connection.

bilderbuchi commented 1 year ago

Do we really need a ratio "normal content": acks of 1:1? A "standard" command exchange causes 4 messages?

Actually, it's more than 4 when the coordinators are involved! If I computed correctly, 8 messages with 1 coordinator, and 12 messages with 2 Coordinators (inter-Node), just to send "Hey, C2.CompA, give me the temperature" -> "It's -5 degrees". :confused:

I guess we'll need the high water marks :sweat_smile:

BenediktBurger commented 1 year ago

That is the reason I went for the ping pong heartbeat: https://zguide.zeromq.org/docs/chapter4/#Heartbeating-for-Paranoid-Pirate

We could (to reduce data transfer) make these heartbeats without any frames (even without names!). Or we just send heartbeats, if explicitly requested. So an actor, which did not get any message in some time, contacts its Coordinator, asking, whether it is still alive.

BenediktBurger commented 1 year ago

Could/should the Data Protocol Proxies live inside a Coordinator, or do you want to create a separate Component for that? The former seems reasonable from my point of view, fewer addresses/entities to keep track of, and the Coordinator could set up the details with the Component when establishing the control connection.

I thought to keep them separate, as you have to connect the Proxies differently, than the Control Coordinators. Also, you need different addresses (at least ports) anyways.

bilderbuchi commented 1 year ago

I thought to keep them separate, as you have to connect the Proxies differently, than the Control Coordinators. Also, you need different addresses (at least ports) anyways.

Yeah, but that could be part of the CONNECT reply/handshaking/setup, no? "Here's the connection details to attach your Data and Log connections to my ports". One Component will need to multiple connections, anyway - do we want to centralise those in the Coordinator (so every Component has effectively 3 connections to a Coordinator), or have 3 separate "central" Components (that could make for some beautiful spiderwebs :D)?

BenediktBurger commented 1 year ago

As the Data protocol is inherently different (just one way), the connection between its Proxies has to be different from the Coordinator connection. Let's discuss it in a separate issue.

Whether these different parts end up in one piece of software or not, the protocols remain separate and the question does not slow down the protocol definition.

BenediktBurger commented 1 year ago

I propose to rename CONNECT / DISCONNECT to SIGNIN / SIGNOUT, in order to differentiate them better from the actual socket connection. That makes it easier in the documentation to differentiate between a connected and a signed in Component.

For example, as a requirement to Message handling, we can state, that the Components must be signed in (which requires to be connected as well). If we stated that they have to be connected, it could be misunderstood, that it is sufficient to do a socket connect.

bilderbuchi commented 1 year ago

I propose to rename CONNECT / DISCONNECT to SIGNIN / SIGNOUT, in order to differentiate them better from the actual socket connection. That makes it easier in the documentation to differentiate between a connected and a signed in Component.

For example, as a requirement to Message handling, we can state, that the Components must be signed in (which requires to be connected as well). If we stated that they have to be connected, it could be misunderstood, that it is sufficient to do a socket connect.

Good call!

bklebel commented 1 year ago

According to the latest state of the PR #38 (and my latest comments therein, e.g. this comment), I think we should boil down the Control Messages to

SIGN-IN
SIGN-OUT
SEND_DIRECTORY - a request for the Directory of a Coordinator, subject to change of the message type signature, possibly we would prefer sth like "GET" or "REQ" over "SEND"(please). What do you think? Or was this the CO_TELL_ALL?
CO_UPDATE - a Coordinator sending its local and global Directory

In the comments to #38, relating to the discussion in #44, I proposed to simplify the messages so that Coordinators do not announce individual SIGN_IN/OUT actions, but simply send their Directory to all connected Coordinators when it changes.

If a sign-in/-out happened within the local Namespace, the respective Coordinator sends the now updated local Directory (and its current global Directory) to all connected remote Coordinators: CO_UPDATE
If a SEND_DIRECTORY request is received by a Coordinator, they reply with their Directories in the CO_UPDATE message.
Coordinators should only consider changes to their own global Directories which come from Coordinators of the corresponding Namespace, i.e. if Co1, Co2 and Co3 (with Namespaces N1, N2, and N3, respectively) are present in the leco network, and Co1 sends a CO_UPDATE message to Co3 which contains updates within the N2 Namespace which Co3 does not have yet, Co3 should, rather than trusting Co1 blindly, at least ask Co2 again for its Directory.
Whenever a Coordinator receives a CO_UPDATE message, it updates its Directories, global and local, and possibly starts to sign-in to new remote Coordinators, as discussed

bilderbuchi commented 1 year ago

Let's avoid discussing the propagation on sign-in/-ou in the issue collecting message types -> #46.

BenediktBurger commented 1 year ago

Regarding message types (and their encoding), we can orient at COAP: https://en.wikipedia.org/wiki/Constrained_Application_Protocol#Request/response_code_(8_bits) or at http status codes: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

BenediktBurger commented 1 year ago

LOCK / UNLOCK a resource (or part of it), see #14

pymeasure / leco-protocol

Message type collection #29

Housekeeping

Routing

Control

Data