Refactor node's public API to facilitate moving state into parallel tasks

mzabaluev commented 5 months ago

The top-down design dictated by traits like SuzukaFullNode where the "god object" is borrowed immutably as &self by async methods does not work well with background tasks that ideally need to work on moved, rather than shared, bits of state. You end up needing Arc<Mutex<_>> for anything mutable even if conceptually it's only used in one task.

Consider refactoring to an API that will have two phases:

Initializing the node through a non-async constructor method or builder API.
The async method to run the node consuming the state, something like pub async fn run(self) -> Result<(), NodeError>.

_Originally posted by @mzabaluev in https://github.com/movementlabsxyz/movement/pull/58#discussion_r1601170509_

Design discussion

The design guidelines have been refined since the original comment; see #340

Implementation PRs

[x] #308
[ ] #404

mzabaluev commented 5 months ago

An alternative approach is to construct two objects: a handle for invoking public methods on, and a future to drive the background implementation.

Here's an example in #58. This also demonstrates how implementation generics can be offloaded to the anonymous future rather than cluttering the signature of the handle object.

In case when an object produces a stream of events and the architecture does not call for multiple subscribers consuming these events, the second returned object could be a stream.

musitdev commented 5 months ago

I agree with you, and this pattern of using Arc<Mutex<_>> to manage concurrent access can lead to at best contention and at worth deadlock when more and more functionalities are added. From my experience, what works well is a sort of supervisor loop that start task and manage data sharing using actor model or event driven pattern. The idea is to dissociate the way you access the data (you access them by moving, cloning, locking depending on the context) from how you process them.

l-monninger commented 5 months ago

@mzabaluev

An alternative approach is to construct two objects: a handle for invoking public methods on, and a future to drive the background implementation.

The advantages of this are presumably:

Signatures for public methods can remain & self because receiver-based tasks are being managed elsewhere.
Overall separation of concerns.

The disadvantages are:

Readability.
API disunity.

Additionally, any state in the background implementation should really only be accessed through the public methods. So, we are kind of disintegrating something that should be integrated.

For me, ideally, these concerns are pushed down from the top-level implementation. I certainly don't care if we're doing the above somewhere. But, to the extent we can wrap things up in a nice bow, I still encourage us to do so.

Ultimately, though, for me the ideal is still to have an API that manages this complexity and looks something like this w.r.t. background tasks: https://github.com/a16z/helios?tab=readme-ov-file#using-helios-as-a-library

@musitdev

The idea is to dissociate the way you access the data (you access them by moving, cloning, locking depending on the context) from how you process them.

I also don't mind a bit of an overuse of synchronization primitives in early development .But, I agree, there should be better patterns we can leverage here. @mzabaluev and I have talked about this being an actor pattern already, though an impure one. Can you sketch out how you would modify the client to implement a purer actor pattern. Diagrams help.

mzabaluev commented 5 months ago

Ultimately, though, for me the ideal is still to have an API that manages this complexity and looks something like this w.r.t. background tasks: https://github.com/a16z/helios?tab=readme-ov-file#using-helios-as-a-library

The API design for a protocol client is often simpler: client operations can often be fully driven as request-response flows, conveniently represented as async methods. In protocols where background state-keeping and processing is needed, patterns like the one proposed above are observed. I have actually pulled it from hyper.

Additionally, any state in the background implementation should really only be accessed through the public methods. So, we are kind of disintegrating something that should be integrated.

For the actor pattern as proposed, all interface methods should remain with the actor's handle, which can be borrowed as &self recipient since it only contains the sender end of the command channel of the actor. If necessary, the handle can be Clone (and the command channel is MPSC) to be replicated among multiple invokers of the methods/commands. If a method requires a return value, this can be done with passing a oneshot sender with the corresponding command and waiting for the value to be returned in the method's async implementation, as recommended in tokio docs.

The background counterpart contains the recipient end of the command channel and all mutable state of the agent. It does not need more complex public interface than a Future impl to schedule it in a task. If the actor can be modelled as producing a stream of events, it can be a Stream instead. Broadcasting and/or pub-sub to propagate events can be handled by the service framework.

(I should work myself up to produce a diagram for this some time later.)

mzabaluev commented 5 months ago

Appropriately, our own unit tests demonstrate the shortcomings of the current approach: when you have a god object to serve external requests and run the service tasks, it needs to be cloned and it needs to have interior mutability.

musitdev commented 5 months ago

I agree with your comment and the event (pubsub) / actor model for the API. The thing I would like to add is we should avoid forcing everything to be async. For example, some actor task like Tx execution should be executed using thread. Future can be a way to manage both, but in fact task are async (network related) or sync (processing task) not both. So when a task is defined, we should have the possibility to define it async or sync depending on it is CPU or Network bound. Async function has the disadvantage to propagate and force everything to be async. This lead to complex issue where CPU bounded task are run with spawn_blocking everywhere, and we end with a thousand thread running when the load increase with no way to manage them. If we refactor the code, we should provide an efficient sync execution part and all CPU task shouldn't be async.

mzabaluev commented 5 months ago

where CPU bounded task are run with spawn_blocking everywhere, and we end with a thousand thread running when the load increase with no way to manage them.

Unless you also multiply runtime instances, all blocking tasks should be scheduled on the same thread pool (which may present its own problems). But I agree that we should not make components async unless there is a benefit to it. Tokio's channels support blocking operations and therefore can be used for message-passing between synchronous and asynchronous tasks.

musitdev commented 5 months ago

Yes, using channel is a pattern to mix async/sync. If we change the archi, one thing that we should do is to integrate our logic in automate state machine so ease the dev and evolution. The pattern is the state of the machine change when receiving event and emit a message on change.

Some logic in node dev can be very complex and hard to implement because they involve a lot of separate components or process. For example, the block processing logic. Currently, it is spread in separate part: read_blocks_from_da manage the Da availability, the execution and the commitment sending but not the error that can occurs. The commiment logic (accepted, rejected) is not implemented but won't be in read_blocks_from_da. So to manage the block logic (Da, execution, commitment process and errors that can occurs) the process is going to be implemented in several places where it can be hard to link together all the logic. What I propose is to define all the block logic in one place that receive event from the other process (Da, Execution, storage, Settlement, Error) and manage all state transition. These state transition will generate task in the other module (Execution, storage, Settlement, ..). This way the complex part, the block logic, is done in one place and the other module are more like actor that process block logic messages and return event after the processing.

mzabaluev commented 3 months ago

A guide that describes the pattern I want to change to: Actors with Tokio.

l-monninger commented 1 month ago

@mzabaluev Can this be close in favor of #340 and follow-ons?

movementlabsxyz / movement

Refactor node's public API to facilitate moving state into parallel tasks #63

Design discussion

Implementation PRs