Actor Model Architecture

After watching the walkthrough video and reviewing the code, I can't help but be reminded of the actor model, which has already addressed similar challenges to the ones we're facing with the current go-kademlia architecture. Further, there's already an established language and common terminology in this area.

In the past, I have worked extensively with protoactor-go (website), which draws inspiration from Akka.NET and Microsoft Orleans. While all three libraries offer more functionality than what we currently need, I believe we can still leverage their concepts and even some of their code. Interestingly rust-libp2p has also converged on a very similar pattern with its Hierarchical State Machines, albeit from a different starting point. There are several important concepts employed by all three actor libraries that I believe could be relevant for us. For example, the actor hierarchies, lifecycle events, behaviours (different from rust-libp2p behaviours), mailboxes, middlewares (easy tracing), supervision, and probably more.

The biggest difference in following the actor pattern would be to have multiple event queues instead of a single one because we'd work with a hierarchy of actors and each actor would have exactly one event queue (mailbox) to which all senders enqueue their messages. I understand why the current implementation uses only a single event queue. We want to have sequential message processing to ensure deterministic tests. By introducing a hierarchical model sequential message processing would only happen on a per actor basis. Communication between actors happens asynchronously via messages.

Looking at the SimpleQuery I think the control flow is not super obvious because state manipulation is distributed across a few different places like requestError, handleResponse, and newRequest. SimpleQuery is a natural contender to be its own actor.

The current examples show how nodes interact with each other using the SimpleQuery implementation. However, I believe there must be another layer above this where the orchestration of queries and other tasks are handled. In the examples, the top-level functions craft queries and enqueue them into schedulers. In the future, I believe, we'll have, let's say, a DHT struct that handles pushing new queries to the scheduler, performing cleanup tasks on shutdown, or generally orchestrating the several actions that are running. Following this assumption, I believe there's a hierarchy that would naturally translate to a hierarchy of actors.

I can't help but feel we'd miss out if we didn't make use of what has already been built in the past. Though, the biggest disadvantage I can see is that we wouldn't have a single event queue - which seems to be a hard requirement? Perhaps there's a middle ground? From my experience, testing in protoactor-go (with a myriad of actors) was very well possible and also fine. Flakiness was not a problem. However, crafting the sequence of events was not always straightforward. This is super subjective but for me, the actor model really resonates with the way I think and makes high asynchronicity bearable.

Note: I'm not advocating for pulling in protoactor-go as a dependency. I just want us not to miss out on what has already been developed there and start a discussion around this.

I am not familiar with the proto actor model, but as long as it is doing the same tasks that the scheduler is doing I am happy to switch to it. I am willing to use well adopted standard where relevant.

If it is doing the same job as the scheduler, it should be easy to implement a new scheduler mode making use of the proto actor model. It is then trivial to compare functionality and performance with the SimpleScheduler.

The main requirement of scheduling is that everything can be run by a single go routine, in a sequential and deterministic manner. We require the minimal amount of async work (essentially only networking tasks), such that a test or a simulation is always reproducible, deterministic, easy to use and reason about.

Please do not consider the SimpleQuery module as finished, I am still working on it.

Meeting Recording CID where the model was discussed: bafybeibxpi2pxqsk7cq3bk4kzle3dh6wnlu3bour72aq5rycv6bdkkdi4u

Feedback from @guillaumemichel copied from the DHT Project Planning Notion page.

It would be easy to implement messages rather than functions to be given to the event queue. For this a new Action module must be defined in https://github.com/plprobelab/go-kademlia/tree/main/events/action. This new MessageAction would be a struct, containing a state (message type and value), and a Run(ctx) function to execute the action to handle this message (just like the Update function in bubbletea). This works for a limited number of message types (because each type requires its own handling), so we cannot create new message types, which is not great for the composable DHT. However, if we keep the current BasicAction that is basically a function, it allows advanced users designing composable DHT RPCs to add their own custom BasicActions to the event queue. So we are not limiting anyone, and only advanced users use the “function” (and not “message” that is more user friendly) action. If a composable DHT feature becomes a standard, we can decide to add it to the MessageAction , so that it can be queued as a “message” and not a “function”.

Feedback from @marcopolo copied from the DHT Project Planning Notion page.

I agree with iand’s point at the end “How complicated is Kademlia really? Do we need the complexity of this actor framework to do what we want?”

I think both the callback model and actor model are equivalent, my question is which one is clearer and more straightforward? Which one is easier to program, debug, and understand?

I’m a bit worried that many actor patterns have an associated framework attached to them. I would push against bringing in any dependency unless it’s worth its weight in gold.

The callback model seems to cover what we want without introducing more unnecessary abstraction.

The system that lets me see what’s happening with fewer jumps across files (or even worse, dependencies) is better.

Things to identify before we decide to go one way or the other (Actor Model or No Actor Model):

What are the pros and cons of the Actor Model Architecture
What do we miss if we don’t go with the Actor Model Arch?
Is that something that we can do later? Or is it we do it now or never?
How much effort is needed to plug in the Actor Model to Gui’s codebase?

I've spent some time on a state machine approach, modeled after the hierarchical state machine design used in rust-libp2p (where it is used as the primary control flow throughout the library, not just Kademlia). The rust-libp2p repo provides the following brief rationale for this:

Using hierarchical state machines is a deliberate choice throughout the rust-libp2p code base. It makes reasoning about control and data flow simple. It works well with Rust's Future model. It allows fine-grain control e.g. on the order child state machines are polled.

The above comes with downsides. It feels more verbose. The mix of control flow (loop, return, break, continue) in poll functions together with the asynchronous and thus decoupled communication via events can be very hard to understand. Both are a form of complexity that we are trading for correctness and performance which aligns with Rust's and rust-libp2p's goals.

The architecture pattern of hierarchical state machines should be used wherever possible.

It continues with advice on the importance of bounding behaviour and resources, including the following:

Bound the number of tasks being spawned. As an example, say we spawn one task per incoming request received from a socket. If the number of pending requests is not bounded by some limit, a misbehaving or malicious remote peer can send requests at a higher rate than the local node can respond at. This results in unbounded growth in the number of requests, and thus unbounded growth in the number of tasks and used memory.

As an experiment I implemented the Kademlia query subsystem as a hierarchical state machine in PR 65.

After spending time on this I believe the state machine approach is simpler and clearer than the existing implementations and also provides the predictable behaviour that we're looking for. It shares a similar architecture to the Actor Model but is declarative in the types that are used which makes the transitions between states explicit.

Based on this I've refined the implementation and promoted it from an example to the main code line in PR 74.

I also wrote a rationale document which I am copying here for visibility:

State Machine Implementation Rationale

The main non-functional design goal of this new Kademlia implementation is to create a more predictable, maintainable and testable codebase that can provide an extensible platform for new DHT capabilities. As OSS it is important that new contributors are attracted to the project so ease of understanding, straightforward debugging and consistency of design are also high priority.

The hierarchical state machine approach addresses these goals in the following ways:

Determinism - The state machine follows a deterministic flow, meaning that given a particular input or set of conditions, it will always produce the same output. This predictability makes it easier to isolate and reproduce issues, as bugs are more likely to occur consistently within specific states or transitions. The events in the state machines can be linearised, in the sense that the same sequence of events will produce the same sequence of state transitions, regardless of the timing of event arrival. When combined with a determistic clock implementation this enables accurate simulation of time-dependent behaviours, such as timeouts.
Transparency - Each state in the machine is assigned a specific responsibility, focusing in on a particular aspect of the protocol's behaviour. This focused scope narrows down the search for potential problems to specific areas of the codebase, reducing the debugging space and making it more manageable. The states are represented as simple data values that can be easily inspected or logged which provides developers with clear visibility into the state transitions and the current state of the protocol during its execution. Developers can track the sequence of events leading to an issue, enabling them to identify root causes and simplifying resolution.
Testability - The state machines can be unit tested by testing each state and transition separately. This approach helps ensure that each component functions correctly, leading to a more reliable and robust protocol implementation. When debugging a specific state, developers can isolate that state's context, ignoring the complexities of other parts of the code. This isolation allows for focused testing and simplifies the understanding of how individual states contribute to the overall behavior. A state machine can be placed in a specific initial state and tested for expected behaviour under various valid and invalid inputs.
Extensibility - A state machine is inherently easier to extend due to its modular and hierarchical structure. Each state in the machine represents a specific behaviour or functionality, making it simpler to add new states and transitions to accommodate additional features or enhancements. The clear separation of concerns, where each state focuses on a specific aspect of the software's behaviour, makes it easier to identify the appropriate location for incorporating new features, reducing the risk of unintended interactions with existing states.

Consistency and Understandability - The hierarchical nature of the state machines promotes readability and understandability of the code. New states are logically integrated into the existing structure, making it easier for developers to comprehend the software's behaviour as a whole. The transitions between states can be drawn to help visualise behaviours. The consistent use of the state machine pattern across components ensures uniformity in behavior and interface, allowing developers to understand the system more quickly and reducing ambiguity.

In addition, the state machine model enforces a structured and controlled execution flow, making it easier to impose limits on resource consumption:

Memory - As the state machines transition between states they can release memory allocated for states that are no longer active.
Concurrency - The state machines provide bounds on concurrency, offering full control over the number of in-flight operations.
CPU - The finite number of states and combinations ensure that the work to be undertaken is bounded and known ahead of time.
Network - Network connections are opened, closed, and reused appropriately based on the state transitions. This controlled socket management can help prevent leaks or exhaustion, leading to more efficient utilization of network resources.

probe-lab / zikade

Actor Model Architecture #38