BUG: URSYS Message Behavior #30

Closed benloh closed 1 year ago

In GitLab by @daveseah on Mar 10, 2021, 15:15

This is an omnibus issue covering the following:

[ ] When two apps use UR.HandleMessage('NET:MESSAGE', data=>{}), issuing UR.CallMessage('NET:MESSAGE',{}) from either app enters an infinite loop (the transaction never completes)
[ ] When two apps use UR.HandleMessage('*:MESSAGE',data=>{}), issuing UR.RaiseMessage('*:MESSAGE', {}) from one app does not send the message across network, but it does reach the handler defined local to the issuer.
[ ] When two apps use UR.HandleMessage('NET:MESSAGE',data=>{}), issuing UR.RaiseMessage('NET:MESSAGE',{}) surprisingly causes the handler that is local to the issuer to also fire (it shouldn't fire, because NET:MESSAGE means that only incoming network messages should be processed by that handler, not LOCAL messages with the same name
[ ] The behavior of the * channel is not well-defined for BOTH Handlers AND Issuers (and doesn't seem to work)

Background

In GEMSTEP, we expanded the message broker system to use channels, which allows us to further distinguish between types of messages and handle their routing more efficiently. This forces implementers to add the "scope" their messages and also keeps a single message namespace from becoming polluted.

In the past, URSYS made the distinction between local and network messages by using two APIs for messaging. Using NetSend() would send messages to remotes if any handlers existed, whereas LocalSend() would only send to handlers local to the issuing app.

As a result, there were six distinct commands:

LocalSend('MESG')           NetSend('MESG')
LocalSignal('MESG')         NetSignal('MESG')
LocalCall('MESG')           NetCall('MESG')

Changing to channel syntax reduces these to:

SendMessage('LOCAL:MESG')   SendMessage('NET:MESG')
RaiseMessage('LOCAL:MESG')  RaiseMessage('NET:MESG')
CallMessage('LOCAL:MESG')   CallMessage('NET:MESG')
---
Note: for local messages, you can leave off the `LOCAL:` prefix

In addition to LOCAL: and NET: channels, we have a utility channel * which is supposed to mean "send the message to all channels", but this has never been used in code prior to GEMSTEP.

Future channels are server specific, such as SRV: to replace the special NET:SRV_ message prefix that is used for all current server-implemented services. This will be more important when we have multi-server, cross-network messaging in place for later phases of GEMSTEP development. Since URSYS is "addressless", having server-specific channels will help with in-band message groups.

In GitLab by @daveseah on Mar 11, 2021, 22:14

After digging through a ton of code, I think the problem is that remote-to-remote call chains were never used by ANY apps. The first incarnation that used the system was PLAE/iSTEP, and NetCreate and MEME codebases all happened to use the version of the call that was server-only. So this feature may have been broken for an indeterminate length of time.

In any case, we don't need to fix it now, so I'm punting this until it actually needs to happen. The input system doesn't rely on it even in the prototype stage, because we can let the server broker the data as it has before. Remote-to-remote calls might have been useful for getting targeted state updates from a specific client, but for now we will use a more general model.

In GitLab by @daveseah on Mar 11, 2021, 22:40

TECHNICAL DISCOVERY NOTES

This applies only to CallMessage. SendMessage and RaiseMessage are fine.

When an calls another remote, the following happens:

origin examines name of message, and if it's NET:* message it initiates a network packet transaction that sends itself to the message broker (server)
broker checks to see if it's a reserved service message beginning with NET:SRV_, and if it isn't then it looks up all the handlers that have registered the message.
The origin packet is cloned and sent to the remote handlers as a forwarded call.
the broker waits for all packets to return, freezing the thread until it does
the remote receives the packet, and is supposed to (1) execute the associated handler of the same name (including the NET: prefix) and then do a transactionReturn() on the packet.
when all transactions are received back from the remotes, the broker unfreezes the original receiving thread. It updates the data payload and issues a transactionReturn() again.
the origin receives the packet back, and is able to detect that it is a returning transaction by looking up its hash in the transaction dictionary. It calls transactionComplete() on the returned packet, and invokes the stored function object that invokes the original callback function, passing along the data payload.

The broken part is the correct detection and handling of a forwarded packet by the remote. There are several flags that are set that are supposed to help the message routers determine what to do, but this has been complicated by the addition of CHANNELS and it appears to have broken the logic.

Solution: rediagram the packet flow and fix the flagging system. Things to remember:

server's m_RouteMessage(pkt) is a re-entrant handler that "sleeps" waiting for the forwarding promises to resolve. This is unlocked by a subsequent incoming packet (the returning one) that also calls m_RouteMessage().
The data structure that makes it possible is the transaction hash table, which uses a hash (the uaddr+pktid) to store the resolve() function for each outgoing packet. As each packet is returned, the hash is the same and so the resolve() function can be invoked.
on the client side, client-network handles routing. Incoming packets are inspected to see if they are new requests or returning transactions. Ultimately, the remote request has to also issue a Call via class-endpoint which calls class-messager. The message router has to correctly set a number of flags to trigger the correct invocation. However the implementation of channels seems to have broken the lookup, so that's where I'd look.
The destination is inferred from the message name and whether it's a new message or a returning transaction. New messages from the NET for calls are reflected back after calling NET:MESSAGE. However, NET:MESSAGE is not technically callable in the local context now; I suspect that's where it's messing up.

In GitLab by @daveseah on Mar 15, 2021, 10:03

CURRENT BEST PRACTICE for URSYS MESSAGING

I thought it would be useful to explain why we haven't needed the remote-to-remote call lately. It's due to a change in how we structure the relation between data, gui update, and gui entry.

PROBLEM ONE: DATA MANAGEMENT

In the earliest versions of STEP, our first Javascript prototypes were designed to talk to the server via a websocket, but each service needed its own API method and custom coding on both clients and server. This was slow to implement and more difficult to understand because each feature used its own logic.

UNISYS, the precursor to URSYS, was a messaging system that allowed us to define messages on-the-fly, defining a standard use pattern of (message, callback) for defining a receiver and (message, data) for requests. This made it much easier for developers to pass data to and from any computer on the network.

However, the connection between data updates and gui updates ("data binding") was still fragile because of the lack of an established pattern to follow.

PROBLEM TWO: SYNCHRONIZING GUI WITH DATA MANAGEMENT

We are now using React for our front end, so called because it "reactive" to changes in data flow. It is a popular framework, which makes it a good choice to make the STEP codebase accessible to a broad pool of developers.

There are two complications we have seen:

React documentation is biased toward examples that demonstrate "the React way" of dataflow and lifecycle. However, React by itself is a poor host for realtime simulation. Our data and lifecycle needs are quite different.
React examples often don't make the distinction between application state and GUI state, which are different entities. Conflating them leads to bugs and data synchronization conflicts.

CURRENT PRACTICE

Current STEP now follows these practices using URSYS:

Data is centralized and managed on the client - Our data modules own their data, and provide accessor methods. They MUST be dependency-free, so any module can import them without consequence.
Data modules are responsible for synchronization - When data changes in a data module by accessing one of its accessor methods, this module keeps the data locally as a cache and also writes to the server as needed.
Our Code is Source of Truth - Our data modules are the source of truth; React mirrors this data for GUI state, not the other way around. All important data AND lifecycle is controlled by our code, and React has been put into a supporting condition.
GUI Rendering - The data modules use URSYS messages to broadcast data updates whenever the data changes. React components that are dependent on this data can handle the data broadcast and update React state. This then drives a GUI re-render. React components only hold COMPONENT STATE, which is DERIVED from our data models.
GUI Changes to Data - When a React component handles a UI event, it may indicate that the user wants to change the underlying data. Instead of trying to change the data directly, these components use the appropriate data module accessor instead. React components then rely on the GUI Rendering cycle described above to effect the overall update resulting from the data change.

RAMIFICATION for URSYS

Because we're using this model, we no longer need to use remote-to-remote message calls as we did with some early versions of STEP. All of our operations now go to the server, which broadcasts data changes to all subscriber apps on the network, and each subscriber then uses that data to change React internal state so the GUI rerenders.

theRAPTLab / gsgo