pymeasure / leco-protocol

Design notes for a generic communication protocol to control experiments and measurement hardware
https://leco-laboratory-experiment-control-protocol.readthedocs.io
MIT License
6 stars 3 forks source link

Coordinator sign in procedure #44

Closed BenediktBurger closed 1 year ago

BenediktBurger commented 1 year ago

The procedure to sign in one Coordinator to another is more complex, let us discuss it here.

This discussion has also consequences on the style of the Directory.

Initial situation:

Goal:

BenediktBurger commented 1 year ago

My initial proposal:

A Coordinator Co1 joining a network follows a few steps:

  1. It signs in to one Coordinator Co2 of the Network.
  2. It sends a CO_TELL_ALL message to Co2, to tell all other Coordinators about Co1s address.
  3. Co2 tells all the Coordinators signed in (Co3, Co4...) about Co1 with a CO_NEW message.
  4. These other Coordinators (Co3, Co4...) sign in to Co1.
  5. All Coordinators are connected to all others.

Two Coordinators shall follow a more thorough sign-in/sign-out procedure than Components (address is for example host and port). The sign-in might happen because of a CO_NEW message arrived or at startup. The sign-out might happen because the Coordinator shuts down.

sequenceDiagram
    participant r1 as ROUTER
    participant d1 as DEALER
    participant r2 as ROUTER
    participant d2 as DEALER
    Note over r1,d1: N1 Coordinator<br>at address1
    Note over r2,d2: N2 Coordinator<br>at address2
    Note over r1,d2: Sign in between two Coordinators
    Note right of r1: shall connect<br>to address2
    activate d1
    Note left of d1: created with<br> name "temp-NS"
    d1-->>r2: connect to address2
    d1->>r2: CO_SIGNIN<br>N1, address1,<br>ref:temp-NS
    par
        d1->>r2: GET local Directory
    and
        Note right of r2: stores N1 identity
        activate d2
        Note left of d2: created with<br>name "N1"
        d2-->>r1: connect to address1
        d2->>r1: CO_SIGNIN<br>N2, address2<br>your ref:temp-NS
        Note right of r1: stores N2 identity
        Note left of d1: name changed<br>from "temp-NS"<br>to "N2"
        d2->>r1: GET local Directory
    end
    d2->>r1: Here is my<br>local Directory
    Note right of r1: Updates<br>global Directory
    d1->>r2: Here is my<br>local Directory
    Note right of r2: Updates<br>global Directory
    Note over r1,d2: Sign out between two Coordinators
    Note right of r1: shall sign out from N2
    d1->>r2: CO_SIGNOUT
    Note right of r2: removes N1 identity
    d2->>-r1: CO_SIGNOUT
    Note right of r1: removes N2 identity
    deactivate d1

Advantage:

Reason for the reference (ref):

BenediktBurger commented 1 year ago

Alternative, such that the reference is not needed anymore: The Co2 responds "illegally" from its ROUTER to the other DEALER socket, in order to identify the DEALER socket with the namespace.

Here the (initial part):

sequenceDiagram
    participant r1 as ROUTER
    participant d1 as DEALER
    participant r2 as ROUTER
    participant d2 as DEALER
    Note over r1,d1: N1 Coordinator<br>at address1
    Note over r2,d2: N2 Coordinator<br>at address2
    Note over r1,d2: Sign in between two Coordinators
    Note right of r1: shall connect<br>to address2
    activate d1
    Note left of d1: created with<br> name "temp-NS"
    d1-->>r2: connect to address2
    d1->>r2: CO_SIGNIN<br>N1, address1
    Note right of r2: stores N1 identity
    Note right of r2: Normally illegal<br>response
    r2->>d1: ACK: Namespace is N2
    Note left of d1: stores N2 as <br> DEALER name

    activate d2
    Note left of d2: created with<br>name "N1"
    d2-->>r1: connect to address1
    d2->>r1: CO_SIGNIN<br>N2, address2
    Note right of r1: stores N2 identity
    Note left of d1: name changed<br>from "temp-NS"<br>to "N2"

    deactivate d2
    deactivate d1
BenediktBurger commented 1 year ago

@bilderbuchi suggested in https://github.com/pymeasure/leco-protocol/pull/38#discussion_r1100593181 :

A Coordinator `Co1` joining a network follows a few steps:
1. It signs in to one Coordinator `Co2` of the Network.
2. If successful, `Co2` sends a list of all Coordinators (and their addresses) that it knows (could be part of sign-in)
3. `Co1` signs in to all Coordinators on this list that it does not know yet (step 1)
4. All Coordinators are connected to all others.
BenediktBurger commented 1 year ago

@bklebel suggested in https://github.com/pymeasure/leco-protocol/pull/38#discussion_r1102355317 :

A Coordinator `Co1`joining a network follows a few steps:
1. `Co1` signs in to one Coordinator `Co2` of the Network.
2. After successful a sign in handshake, `Co1` requests a list of all other Coordinators known to `Co2` (`Co3`, `Co4`, ...). 
3. `Co1` signs in to all the Coordinators on the list of `Co2`
4. `Co1` requests the lists of the connected Coordinators from all now connected Coordinators (except from `Co2`, it already has this one)
5. `Co1` compares the lists, and notifies Coordinators which have an incomplete set of the missing Coordinators
6. Coordinators with previously incomplete sets do sign ins to the new Coordinators they have been told about, until all Coordinators are connected to all others

In this case, if some Coordinator somehow drops out, those lists come out of sync, at every new sign in of a new Coordinator, the whole system brings itself back into sync. The only new message type would be "hey, you are missing a few more connections". For clarity, I separated the request for the list of connected Coordinators from the initial SIGNIN, although that could be part of the SIGNIN flow.

We would need a separate mapping from Coordinator full name to Coordinator (router) address.

Remember that we have a separate DEALER for every other Coordinator's connected ROUTER socket, yes, we need (in the implementation) to keep track of which DEALER socket belongs to which remote Coordinator's ROUTER address, but we need to do that anyways.

BenediktBurger commented 1 year ago

track of which DEALER socket belongs to which remote Coordinator's ROUTER address, but we need to do that anyways.

we do not need (in principle) to keep track of addresses. As long, as I know, which DEALER socket leads to which Namespace, I do not need the address of that Namespace's Coordinator (only for the initial connect).

I see, however, the benefit of your proposals. So (using your both proposals as a base):

  1. Co1 signs in to Co2 of the Network
  2. Co2 also signs in to Co1.
  3. Both Coordinators request the local Directory of the other one, containing Components and Coordinators (including addresses).
  4. Both Coordinators repeat steps 1-3 for each Coordinator they are not yet signed in with.

The advantage of this (note step 4 for both Coordinators), that two Networks may be joined, if a single Coordinator of one Network signs in to one of the other Network. Another advantage: As all other Coordinators update their Coordinator list during sign in, the Network gets "healed" from missing links.

Example:

It works even in another sequence:

BenediktBurger commented 1 year ago

If we store the address (in our local directory), the CO_SIGNIN message (which contains Namespace in the sender name and the address) is sufficient to identify the corresponding DEALER socket:

  1. create DEALER socket with temporary name. Store this temporary namespace and address pair
  2. sign in to the other Coordinator, which responds with its own CO_SIGNIN message with the own address
  3. Compare the address in the CO_SIGNIN with the addresses with those in the namespace list. Change the namespace of the corresponding address.

One difficulty: If the address used for connection differs from the address the Coordinator sends via CO_SIGNIN (e.g. full name like "machine.company.com" vs. "machine" vs. "123.03.5.12"), it does not work anymore, without doing name resolution and comparing the IP address (as bytes, not as string, due to zeros).

bilderbuchi commented 1 year ago

@bklebel remember, when designing these flows, that they should be as composable as possible, with few decisions/branches, and needing little/no state. One process can trigger another (or itself again). This should result in a cleaner, slimmer list of processes, and there is less state carried around.

In your case, the steps 1., 3., 6. are quite similar - ideally they would just trigger another instance of the same sign-in procedure, and everything would shake out as desired automatically, without many checks for "unknown this", "rest of that", etc. That's what I aimed at with my proposal.

In that vein, I like Benedikt's last cleaned up proposal, which is basically a two-sided version of this: (tweaked the wording to make it clear that the same process just starts again with another set of participants)

  1. Co1 signs in to Co2 of the Network
  2. Co2 also signs in to Co1.
  3. Both Coordinators request the local Directory of the other one, containing Components and Coordinators (including addresses).
  4. Both Coordinators sign in to each Coordinator they are not yet signed in with.
bklebel commented 1 year ago

@bilderbuchi yes, that was in principle the idea behind it, in trying to make it more clear I muddied it. I like your last proposal here. What was missing in your original proposal (I think) was the part where, in one sign-in process, the Coordinator which starts the conversation also tells the other about their local directory (possibly assuming that the directory of a newly started Coordinator is empty).

@bmoneke I like the "illegal" answer to the DEALER socket. In general, I think it would be a good idea to rather say that a Coordinator may only start a conversation using its DEALER socket, but if there is just a bit of back-and-forth in that one conversation, it should best go across this one channel of DEALER-ROUTER connection. Regarding the implementation, if we handled it otherwise, the Coordinator would now have to filter through incoming messages on the ROUTER socket which are related to this one conversation here, which would be more complicated than having this one DEALER connection here anyways, to which we can answer very simply.

BenediktBurger commented 1 year ago
  1. Both Coordinators sign in to each Coordinator they are not yet signed in with

What this wording misses, is the repetition of steps 3 and 4 unless we include both steps into the definition of "sign in".

Another additional idea: we want to add "Coordinator heartbeats" where they announce the local directory regularly (every fraction of an hour). At reception of that message, the receiving Coordinator shall connect to all unknown Coordinators. We can use the same message after a sign in.

So the following change:

  1. Both Coordinators request the local directory of the other one.
  2. Both Coordinators handle the received "local directory message", see xy.

Then the message handling states: Check whether each Coordinator in the local Directory is known. If not, sign in.

BenediktBurger commented 1 year ago

I like the "illegal" answer to the DEALER socket.

In fact. The Dealer response is more difficult to handle:

First proposal does not require any additional logic:

  1. The Coordinator reads all the messages at the Router socket.
  2. A message is for the Coordinator itself: handle it.
  3. It is a co_signin: use it for the dealer renaming

Proposal with answer to dealer socket:

I might have a solution: a list of dealer sockets to check, whether a message arrived.

Handle_router()
for sock in open_sockets:  # sockets requiring an answer
    if sock.poll():
        Handle_dealer_message() 

If we do not wait for any dealer socket, that for loop ends immediately.

Let's do it via the Dealer!

bilderbuchi commented 1 year ago

What was missing in your original proposal (I think) was the part where, in one sign-in process, the Coordinator which starts the conversation also tells the other about their local directory (possibly assuming that the directory of a newly started Coordinator is empty).

Indeed, you're right.

What this wording misses, is the repetition of steps 3 and 4 unless we include both steps into the definition of "sign in".

They are part of the "sign in" -- the list of enumerated steps defines the process (the way I understood it).

BenediktBurger commented 1 year ago

I updated my version in the PR according to this discussion.

BenediktBurger commented 1 year ago

I have to modify the sign-in procedure, as I run into timing issues (the CO_SIGNIN messages arrives at the router before the Acknowledgment at the DEALER, such that several connections are established...)

sequenceDiagram
    participant r1 as ROUTER
    participant d1 as DEALER
    participant r2 as ROUTER
    participant d2 as DEALER
    Note over r1,d1: N1 Coordinator<br>at address1
    Note over r2,d2: N2 Coordinator<br>at address2
    Note over r1,d2: Sign in between two Coordinators
    Note right of r1: shall connect<br>to address2
    activate d1
    Note left of d1: created with<br> name "temp-NS"
    d1-->>r2: connect to address2
    d1->>r2: V|COORDINATOR|N1.COORDINATOR|H|<br>CO_SIGNIN
    Note right of r2: stores N1 identity
    r2->>d1: V|N1.COORDINATOR|N2.COORDINATOR|H|ACK
    Note left of d1: DEALER name <br>set to "N2"
    d1->>r2: V|N1.COORDINATOR|N2.COORDINATOR|H|<br>Here is my local directory
    Note right of r2: Updates global <br>Directory and signs <br>in to all unknown<br>Coordinators,<br>also N1
    Note over d1,r2: Mirror of above sign in
    activate d2
    Note left of d2: created with<br>name "N1"
    d2-->>r1: connect to address1
    d2->>r1: V|COORDINATOR|N2.COORDINATOR|H|<br>CO_SIGNIN
    Note right of r1: stores N2 identity
    r1->>d2: V|N2.COORDINATOR|N1.COORDINATOR|H|ACK
    Note left of d2: Name is already "N1"
    d2->>r1: V|N2.COORDINATOR|N1.COORDINATOR|H|<br>Here is my local directory
    Note right of r1: Updates global <br>Directory and signs <br>in to all unknown<br>Coordinators
    Note over r1,d2: Sign out between two Coordinators
    Note right of r1: shall sign out from N2
    d1->>r2: CO_SIGNOUT
    Note right of r2: removes N1 identity
    d2->>-r1: CO_SIGNOUT
    Note right of r1: removes N2 identity
    deactivate d1

Now we have a hard sequence (no concurrency), again symmetry. We use the directory exchange to connect to the other Coordinator.

BenediktBurger commented 1 year ago

We could even use the normal "SIGNIN"/"SIGNOUT" commands (with filtering for the Component name==COORDINATOR)!

BenediktBurger commented 1 year ago

For this reason, it is good, having a test environment, as I encountered the problem while updating my implementation according to the PR.

bilderbuchi commented 1 year ago

For this reason, it is good, having a test environment, as I encountered the problem while updating my implementation according to the PR.

Agreed, I like it! Ideally, these tests make it into a test suite, so that we can keep checking assumptions etc.

BenediktBurger commented 1 year ago

With test environment, I meant an actual implementation (in contrast to production environment). I have a script starting a second Coordinator connecting to a first one, in order to test it quite easily.

However, I'm writing the appropriate tests as well to ensure proper working and catching errors.

This error, however, manifested (due to timing) in the implementation (connecting two Coordinators) and not in unit tests.

I'm using the Coordinators / Actors already in the lab (keeping them in sync with decided upon points of leco).

BenediktBurger commented 1 year ago

Done by #38