paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

Move Aura, Babe, GrandPa and the informant inside of the Service #3097

Open tomaka opened 5 years ago

tomaka commented 5 years ago

The situation right now is that the Service contains (as in, "owns") the import queue, transaction queue, network, client, RPC endpoint, and offchain workers, but Aura/Babe, GrandPa and the informant are built on top and plug themselves into the service with hooks.

This design doesn't necessarily make sense to me, and I think that we should just put everything inside the service.

The only main change that would have to be done is refactor the service creation process to have an additional step. Instead of creating the service, then Aura/Babe/GrandPa, we should have an intermediary state that Aura/Babe/GrandPa are built upon, and then only build the service incorporating Aura/Babe/GrandPa

dvc94ch commented 5 years ago

I think the reason it was done this way was because of the idea of pluggable consensus so the service can't own aura/babe/grandpa unless there is a trait they can implement.

tomaka commented 5 years ago

the service can't own aura/babe/grandpa unless there is a trait they can implement.

This is exactly the case for the import queue and the network specialization. They are owned by the service and being interfaced through a trait.

dvc94ch commented 5 years ago

So what would those traits look like and would it work for all consensus algorithms?

tomaka commented 5 years ago

Right now, everything the consensus and finality need is passed through Arcs at initialization, and all they do is spawn a few tasks in the background. So that trait could even be Any. It would however be nice to require Future, for #2536.

There's obviously not much point in adding a few Box<Any> in the service if everything is already done in the background. However, after that issue we will unlock some potential additional refactoring and Arcs removal.

Depending on the general feeling, one thing that might be nice is to make Aura/Babe/Grandpa produce events such as "generate blocks", "finalize block", "broadcast message", etc. instead of directly performing the actions itself. But that's for later.

rphmeier commented 5 years ago

Depending on the general feeling, one thing that might be nice is to make Aura/Babe/Grandpa produce events such as "generate blocks", "finalize block", "broadcast message", etc. instead of directly performing the actions itself. But that's for later.

For finality in particular ordering and atomicity is really important. I don't see an event-based architecture going anywhere.

A phased construction is reasonable, but it shouldn't involve direct references to specific consensus algorithms within the generic service code.

IMO, the abstractions we have in pluggable consensus are fine as-is and shouldn't interfere with any reordering of the pipeline.

tomaka commented 5 years ago

For finality in particular ordering and atomicity is really important. I don't see an event-based architecture going anywhere.

Ordering and atomicity are exactly what I'm trying to enforce by having a stream of events.

Right the ordering of operations within Substrate is totally "unspecified"/undocumented and is probably in two or three people's heads right now.

What I'd like to move towards is actually making it clear in the API and documentation in which ways it is legal to interface with GrandPa/Aura/Babe, and in which way GrandPa/Aura/Babe interface with the rest of the codebase. But it's difficult to do that if the code consists in threads and Arcs shared towards the code base that can be called at any time.

I'm always reluctant to say out loud the word "event", because people immediately assume that the events are loose and unordered. That has not to be the case. Maybe "message" or "action" is more appropriate.

tomaka commented 5 years ago

@rphmeier

What I'd like to aim for in the very long term is an API similar to this for Aura and Babe:

/// Implemented on `Aura` and `Babe`.
pub trait ConsensusEngine {
    type Block;

    /// Runs the consensus engine until the next event.
    /// Can be interrupted at any time by dropping the future, in case we need to call one of the
    /// other methods of this trait.
    /// (for the sake of the example, assume that this method is valid to write. In practice, we probably
    /// need some trait trickery to make this work).
    fn next_event(&mut self) -> impl Future<ConsensusEvent<Self::Block>>;

    /// Called when we receive a block from the network.
    /// Would replace the current `Verifier` trait.
    fn verify_block(&mut self, origin: BlockOrigin, header: Header) -> Result<(), Error>;

    /// Can be used to stop authoring blocks if we are offline.
    fn set_authoring_blocked(&mut self, blocked: bool);

    // something like this? I don't know enough about the keystore to know if this is right
    // would be called by the RPC if the user changes key
    // fn set_signing_key(&mut self, some_key_index: ???);

    // ... any other way to update the behaviour of the consensus while it is ...
    // ... running could be added here ...
}

/// Event produced by the consensus engine.
pub enum ConsensusEvent<Block: BlockT> {
    /// A block has been generated and has been imported by the client.
    BlockProduced(Block::Hash),

    // ... any other event that the service should react to ...
}

Does that make sense?

The only way the consensus engine would communicate with the outside is through this trait and through the Client which would be passed via an Arc. I'm also considering passing the &Client by parameter to every function, so that all the consensus engine would store is purely state.

This would make it possible to completely remove all Arcs and Mutexes from the consensus engines. It would also make it possible to create a consensus engine with just Aura::new or Babe::new, and generally considerably simplify the service creation.

If this general direction looks good, I'd move slowly towards this goal and do something similar for finalization engines.

rphmeier commented 5 years ago

This would make it possible to completely remove all Arcs and Mutexes from the consensus engines

I don't see how it can reduce sharing in the case of block verification vs other consensus work. All you've done is unify the background worker future and synchronous verification. It's also not obvious how parallel verification queues are supposed to work with this. Note that ImportQueue is more important than basic_queue::Verifier, which is only an auxiliary trait for a basic_queue.

Is the plan also to make networking occur via events?

In general, it seems like you are only looking as far as Aura and Babe. What about GRANDPA? What about every other consensus engine? I'd rather not make the same mistake as parity-ethereum by trying to codify engine behavior to the point where it becomes impossible to build anything new on the framework.

gnunicorn commented 5 years ago

I agree with the approach of the general direction, but I am unsure about the interface design for the simple fact that we don't have much experience with plugging consensus yet: we only have two actual authoring and one finality gadget - we haven't even ported rhododendron yet, nor do we have the POW finished. Both come with very different approaches, with RHD providing both authoring and finality at the same time and requiring networking access, for POW I am not even sure what would be needed in terms of API.

I agree that what we currently have isn't pretty but also agree with @rphmeier that it is "fine as is", as it would for sure allow all of the above to be implement. I am afraid that by defining an API this early on, we restrict this more than we need or want out of a lack of experience. I would postpone this for now.

tomaka commented 5 years ago

I'd rather not make the same mistake as parity-ethereum by trying to codify engine behavior to the point where it becomes impossible to build anything new on the framework.

I think that the way to avoid that is exactly to provide one single entry point for all interactions between the consensus engine and the outside. This entry point can then be modified depending on our needs. In my opinion, the problem of parity-ethereum is that we assume that the consensus behaves in a very precise way. I'd like to clearly write down exactly (through this trait) what we require this behaviour to do, and we can then decide to loosen it or tighten it. Right now we already require the consensus to behave a specific way, except that it is implicit instead of explicit.

It's also not obvious how parallel verification queues are supposed to work with this.

True. I generally prefer designing things "for now" rather than for future use case.

But to answer more precisely, we could slightly modify the trait. Instead of having a synchronous verify method call, we could replace it with start_verifying (which returns nothing and only adds the block to a queue), and add a VerificationFinished variant in ConsensusEvent.

This moves the import queue within the consensus engine, but I guess that they are already quite tied together.

tomaka commented 5 years ago

I am afraid that by defining an API this early on, we restrict this more than we need or want out of a lack of experience. I would postpone this for now.

trying to codify engine behavior to the point where it becomes impossible to build anything new on the framework.

Just to insist on these points: we already have requirements. They are just implicit.

Just because we write this behaviour requirement down as a trait doesn't mean that this trait shouldn't change in the future.

The objective is adding documentation, clarity, and predictability to the code.

rphmeier commented 5 years ago

The current requirements are less than what is being set out here -- for instance, you are going from "can do any arbitrary work in the background" to "can do work only in terms of predefined events". When these events are insufficient for a downstream crate to function, there are 2 options.

  1. make a change to Substrate (indicating that this approach is fragile).
  2. mix Substrate events with just doing arbitrary work in the background. Now you also run into trickier issues w.r.t. reasoning about ordering of events.

And of course it is possible to change code, but something as pervasive as a consensus engine has the potential to become so entrenched that it's not practical to do at all. Take a look at parity-ethereum.


About the event-based approach in general: Consensus engines right now can base internal state on when something has actually been done (due to invariants of functions that they call internally and synchronously). With the event-returning approach, you introduce an inherent race that this API at least doesn't address: you need another hook for when that event has been processed. It also means that engines have to introduce an internal state of "I dispatched an event but don't know if it's been enacted yet".

rphmeier commented 5 years ago

The assumption we are currently making is that consensus engines are free to set their own assumptions. They should be drawing on core pieces of Substrate infrastructure, and beyond that, defining interfaces with clear invariants in the engine's crate.

This is not the same assumption as a central ConsensusEngine trait would make, since it has to decide how networking, block import, DB access, etc. work for every possible consensus engine.

rphmeier commented 5 years ago

I generally prefer designing things "for now" rather than for future use case.

It's not just arbitrary future use-case I'm talking about here: Verification queues are concrete items on the Substrate/Polkadot engineering roadmaps that we've already thought through where they would fit in with the current design. Removing that and ignoring the previous design thinking is just a step backwards.

tomaka commented 5 years ago

The assumption we are currently making is that consensus engines are free to set their own assumptions.

Then what about something more like this?

pub trait ConsensusEngine: ConsensusEngine {
    /// Can be used to stop authoring blocks if we are offline.
    fn set_authoring_blocked(&mut self, blocked: bool);
}

pub trait ConsensusEngineSerialVerify: ConsensusEngine {
    fn verify_block(&mut self, origin: BlockOrigin, header: Header) -> Result<(), Error>;
}

pub trait ConsensusEngineParallelVerify: ConsensusEngine {
    fn start_verify(&mut self, origin: BlockOrigin, header: Header);
    fn next_verified(&mut self) -> ???;
}

pub trait ConsensusEngineWithNetwork: ConsensusEngine {
    fn on_network_message(&mut self, ...);
}

/// Required only if we put an RPC server on top of the service.
pub trait ConsensusEngineRpc {
    // fn set_signing_key(&mut self, some_key_index: ???);
}

The plugging would be done by the user when creating the service. Some components would require the consensus to implement some trait, while some other components would require other traits.

The important point is not that this precise trait written above should be adopted. The important point is that we should move from passing Arcs and Mutexes everywhere at initialization to having one single point of communication between the consensus and the outside.

The trait described above works right now for Aura and Babe, which is why I drafted it. But again, I'm convinced that whatever design we come up with, we can modify the structure of the trait in order to remain generic.

Verification queues are concrete items on the Substrate/Polkadot engineering roadmaps

It'd be great to see that engineering roadmap if possible, in order to take more informed decisions.

rphmeier commented 5 years ago

Some components would require the consensus to implement some trait, while some other components would require other traits.

Are we in a situation where components are expecting something of the consensus engine? Typically it's the opposite situation: the consensus engine is expecting certain behaviors of other components. That's where traits are useful.

The important point is that we should move from passing Arcs and Mutexes everywhere at initialization to having one single point of communication between the consensus and the outside.

Why is this desirable? What seems more important is that the consensus engine communicates via interfaces, not whether those interfaces are implemented via Arc'd types.

My point is that a trait ConsensusEngine is similar to trait ComputerProgram...keeping it simple (i.e. fn run) is the only reasonable approach. I agree that abstracting over instantiation is difficult which is why we haven't attempted to do that.

burdges commented 5 years ago

Ain't clear if this is relevant, but..

We'll want a consensus engines that interact more with themselves across epochs and/or subdivide epochs for constant block time, so vaguely like the delay babe imposes upon the key, but more complex. As an example, this one spreads the full slot assignment process across like five epochs: https://github.com/w3f/research/blob/master/docs/polkadot/BABE/sortition/index.md#secondary-randomness Avoiding slashing might require reliable secret storage across epochs.

We also want VDFs to be dedicated threads that interact with the consensus engine. And collators and bridges could've strange consensus engines too.