probe-lab / zikade

A Go implementation of the libp2p Kademlia DHT specification
Other
10 stars 3 forks source link

Actor Model Architecture #38

Closed dennis-tra closed 11 months ago

dennis-tra commented 1 year ago

After watching the walkthrough video and reviewing the code, I can't help but be reminded of the actor model, which has already addressed similar challenges to the ones we're facing with the current go-kademlia architecture. Further, there's already an established language and common terminology in this area.

In the past, I have worked extensively with protoactor-go (website), which draws inspiration from Akka.NET and Microsoft Orleans. While all three libraries offer more functionality than what we currently need, I believe we can still leverage their concepts and even some of their code. Interestingly rust-libp2p has also converged on a very similar pattern with its Hierarchical State Machines, albeit from a different starting point. There are several important concepts employed by all three actor libraries that I believe could be relevant for us. For example, the actor hierarchies, lifecycle events, behaviours (different from rust-libp2p behaviours), mailboxes, middlewares (easy tracing), supervision, and probably more.

The biggest difference in following the actor pattern would be to have multiple event queues instead of a single one because we'd work with a hierarchy of actors and each actor would have exactly one event queue (mailbox) to which all senders enqueue their messages. I understand why the current implementation uses only a single event queue. We want to have sequential message processing to ensure deterministic tests. By introducing a hierarchical model sequential message processing would only happen on a per actor basis. Communication between actors happens asynchronously via messages.

Looking at the SimpleQuery I think the control flow is not super obvious because state manipulation is distributed across a few different places like requestError, handleResponse, and newRequest. SimpleQuery is a natural contender to be its own actor.

The current examples show how nodes interact with each other using the SimpleQuery implementation. However, I believe there must be another layer above this where the orchestration of queries and other tasks are handled. In the examples, the top-level functions craft queries and enqueue them into schedulers. In the future, I believe, we'll have, let's say, a DHT struct that handles pushing new queries to the scheduler, performing cleanup tasks on shutdown, or generally orchestrating the several actions that are running. Following this assumption, I believe there's a hierarchy that would naturally translate to a hierarchy of actors.

I can't help but feel we'd miss out if we didn't make use of what has already been built in the past. Though, the biggest disadvantage I can see is that we wouldn't have a single event queue - which seems to be a hard requirement? Perhaps there's a middle ground? From my experience, testing in protoactor-go (with a myriad of actors) was very well possible and also fine. Flakiness was not a problem. However, crafting the sequence of events was not always straightforward. This is super subjective but for me, the actor model really resonates with the way I think and makes high asynchronicity bearable.

Note: I'm not advocating for pulling in protoactor-go as a dependency. I just want us not to miss out on what has already been developed there and start a discussion around this.

guillaumemichel commented 1 year ago

I am not familiar with the proto actor model, but as long as it is doing the same tasks that the scheduler is doing I am happy to switch to it. I am willing to use well adopted standard where relevant.

If it is doing the same job as the scheduler, it should be easy to implement a new scheduler mode making use of the proto actor model. It is then trivial to compare functionality and performance with the SimpleScheduler.

The main requirement of scheduling is that everything can be run by a single go routine, in a sequential and deterministic manner. We require the minimal amount of async work (essentially only networking tasks), such that a test or a simulation is always reproducible, deterministic, easy to use and reason about.

Please do not consider the SimpleQuery module as finished, I am still working on it.

yiannisbot commented 1 year ago

Meeting Recording CID where the model was discussed: bafybeibxpi2pxqsk7cq3bk4kzle3dh6wnlu3bour72aq5rycv6bdkkdi4u

Feedback from @guillaumemichel copied from the DHT Project Planning Notion page.

It would be easy to implement messages rather than functions to be given to the event queue. For this a new Action module must be defined in https://github.com/plprobelab/go-kademlia/tree/main/events/action. This new MessageAction would be a struct, containing a state (message type and value), and a Run(ctx) function to execute the action to handle this message (just like the Update function in bubbletea). This works for a limited number of message types (because each type requires its own handling), so we cannot create new message types, which is not great for the composable DHT. However, if we keep the current BasicAction that is basically a function, it allows advanced users designing composable DHT RPCs to add their own custom BasicActions to the event queue. So we are not limiting anyone, and only advanced users use the “function” (and not “message” that is more user friendly) action. If a composable DHT feature becomes a standard, we can decide to add it to the MessageAction , so that it can be queued as a “message” and not a “function”.

Feedback from @marcopolo copied from the DHT Project Planning Notion page.

  • I agree with iand’s point at the end “How complicated is Kademlia really? Do we need the complexity of this actor framework to do what we want?”
  • I think both the callback model and actor model are equivalent, my question is which one is clearer and more straightforward? Which one is easier to program, debug, and understand?
    • I’m a bit worried that many actor patterns have an associated framework attached to them. I would push against bringing in any dependency unless it’s worth its weight in gold.
    • The callback model seems to cover what we want without introducing more unnecessary abstraction.
  • The system that lets me see what’s happening with fewer jumps across files (or even worse, dependencies) is better.
yiannisbot commented 1 year ago

Things to identify before we decide to go one way or the other (Actor Model or No Actor Model):

iand commented 1 year ago

I've spent some time on a state machine approach, modeled after the hierarchical state machine design used in rust-libp2p (where it is used as the primary control flow throughout the library, not just Kademlia). The rust-libp2p repo provides the following brief rationale for this:

Using hierarchical state machines is a deliberate choice throughout the rust-libp2p code base. It makes reasoning about control and data flow simple. It works well with Rust's Future model. It allows fine-grain control e.g. on the order child state machines are polled.

The above comes with downsides. It feels more verbose. The mix of control flow (loop, return, break, continue) in poll functions together with the asynchronous and thus decoupled communication via events can be very hard to understand. Both are a form of complexity that we are trading for correctness and performance which aligns with Rust's and rust-libp2p's goals.

The architecture pattern of hierarchical state machines should be used wherever possible.

It continues with advice on the importance of bounding behaviour and resources, including the following:

Bound the number of tasks being spawned. As an example, say we spawn one task per incoming request received from a socket. If the number of pending requests is not bounded by some limit, a misbehaving or malicious remote peer can send requests at a higher rate than the local node can respond at. This results in unbounded growth in the number of requests, and thus unbounded growth in the number of tasks and used memory.

As an experiment I implemented the Kademlia query subsystem as a hierarchical state machine in PR 65.

After spending time on this I believe the state machine approach is simpler and clearer than the existing implementations and also provides the predictable behaviour that we're looking for. It shares a similar architecture to the Actor Model but is declarative in the types that are used which makes the transitions between states explicit.

Based on this I've refined the implementation and promoted it from an example to the main code line in PR 74.

I also wrote a rationale document which I am copying here for visibility:

State Machine Implementation Rationale

The main non-functional design goal of this new Kademlia implementation is to create a more predictable, maintainable and testable codebase that can provide an extensible platform for new DHT capabilities. As OSS it is important that new contributors are attracted to the project so ease of understanding, straightforward debugging and consistency of design are also high priority.

The hierarchical state machine approach addresses these goals in the following ways:

In addition, the state machine model enforces a structured and controlled execution flow, making it easier to impose limits on resource consumption: