paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.89k stars 697 forks source link

Elastic Scaling: Streamlined Block Production #5190

Open eskimor opened 3 months ago

eskimor commented 3 months ago

For maximum throughput in elastic scaling we need to streamline the block production, otherwise best throughput can only be achieved by restricting the collator set to size 1, which would impact liveness and decentralization.

Problem Statement

Blocks are built in sequence. You can only start building the next block once you know the state of the chain after the previous block. This implies in the current implementation that in order to start building you need to have downloaded and fully imported the previous block. Hence, you can only start building the next block after:

  1. The previous block was fully built.
  2. Was downloaded.
  3. Was executed/imported again on the local collator.

This adds a significant overhead, essentially halving the potential execution throughput.

Proposed Solution

  1. During block production, the block author will stream hashes of all the transactions it puts into the block. The streaming is necessary as we can't know in advance how many transactions we can fit.
  2. Other nodes, check their mempool if the relevant transactions are present. If no, we wait for them to arrive.
  3. We build the block based on the stream of incoming hashes.
  4. Now we know the new state roughly at the same time as the original author built it: Barely any additional delay introduced.
  5. We receive the block announcement from the block author and compare the state root with the one at which we arrived. If it matches, we have everything we need. If any of the above failed, we just fallback to normal block import and download the block as usual in that phase.
  6. We successfully doubled the throughput and reduced bandwidth requirements.

Possible future enhancements

A list of hashes should be small enough, but if we wanted, more compact representations of the included transactions should certainly be feasible.

Alternatives

If we find extrinsics are small enough or that we just don't care about bandwidth at this point, we could stream the extrinsics directly, instead of their hashes.

Considerations

By sending the list of hashes of transactions before building the block, a block author reveals itself earlier. This is not a problem with Aura (as the author is known anyway), but might be unfortunate in other protocols like SASSAFRAS. For SASSAFRAS, using the proposed anonymizing technique by picking a random node as the relayer should do the trick.

sandreim commented 3 months ago

Blocks are built in sequence. You can only start building the next block once you know the state of the chain after the previous block. This implies in the current implementation that in order to start building you need to have downloaded and fully imported the previous block. Hence, you can only start building the next block after:

  1. The previous block was fully built.
  2. Was downloaded.
  3. Was executed/imported again on the local collator.

This adds a significant overhead, essentially halving the potential execution throughput.

Yeah, the block authorship is decoupled from the block import, so if the previous block import doesn't finish, we will build on it's parent and create a fork.

Proposed Solution

  1. Before block production starts, the block author will gossip a list of hashes of all the transactions it wants to put into the block.

We don't really know how many fit in a block until we actually applied them. So we can't fully rely on a snapshot, we might put too little, or a bit too many. I would consider to push the transactions in batches and then gossip the hashes of what went into a batch.

  1. Other nodes, check their mempool if all the transactions are present. If no, we wait a bit and try again.

We might for example never learn about a transaciton in good time, so all the work is wasted and so is the next slot. I propose to add a protocol for fetching transactions. The process should start as soon as we have one missing.

  1. Eventually we should have all the referenced transactions and can start building the very same/equivalent block in parallel with the actual block author.

I don't think we should be building the complete block, we just need to do the storage changes by applying the extrinsic. We do however want to do all the checks on the block announcement and maybe fast import it since we have the storage changes.

  1. Now we know the new state roughly at the same time as the original author built it: Barely any additional delay introduced. We need to wait for the announcement of prev block, so that actually depends on the network latency.

  2. We receive the block announcement from the block author and compare the state root with the one at which we arrived. If it matches, we have everything we need. If any of the above failed, we just fallback to normal block import and download the block as usual in that phase.

  3. We successfully doubled the throughput and reduced bandwidth requirements. Not sure what you mean by reduced bandwidth requirements. We actually raise it by additional gossip.

Possible future enhancements

A list of hashes should be small enough, but if we wanted, more compact representations of the included transactions should certainly be feasible.

I don't think this should be a problem, especially when we know who the block author is (AURA), or even with SASSAFRAS anonymisation tech when talking to the block author. In any case it should be at most a couple of hundred Kbytes.

Additionally, we can stack another optimisation on top of this proposal. We can change the AURA slot mapping such that one collator has N consecutive slots to build. N would likely be fixed, but ideally should be dynamic, matching the number of cores assigned to the parachain at an RCB

eskimor commented 3 months ago

We don't really know how many fit in a block until we actually applied them. So we can't fully rely on a snapshot, we might put too little, or a bit too many. I would consider to push the transactions in batches and then gossip the hashes of what went into a batch.

That's a very good point, so likely not a single message but rather a "streaming" protocol where we send hashes of transactions as we put them into the block. We will likely need direct connections to each collator then as gossiping will likely mess up the ordering, which adds complexity (buffering) and adds to latency. I would leave that decision to the implementing engineer though: Whatever works. In any case, this just increased the complexity/effort of the project.

-> Updated the ticket accordingly.

We might for example never learn about a transaciton in good time, so all the work is wasted and so is the next slot. I propose to add a protocol for fetching transactions. The process should start as soon as we have one missing.

As long as we don't have data on mempool performance, I would consider this premature optimization. Relying on the mempool is certainly the easiest first step and is likely sufficient in practice. By adding another protocol we might only require more bandwidth (same transaction transmitted twice), while in the worst case not improving latency at all (by the time the request round trip is done the gossip message arrived too).

I don't think we should be building the complete block, we just need to do the storage changes by applying the extrinsic. We do however want to do all the checks on the block announcement and maybe fast import it since we have the storage changes.

Pretty sure that's almost the same thing as building the block.

Polkadot-Forum commented 3 months ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/elastic-scaling-mvp-launched/9392/1

alindima commented 3 months ago

By sending the list of hashes of transactions before building the block, a block author reveals itself earlier. This is not a problem with Aura (as the author is known anyway), but might be unfortunate in other protocols like SASSAFRAS. For SASSAFRAS, using the proposed anonymizing technique by picking a random node as the relayer should do the trick.

Additionally, the next block author needs to know who is the current block author (before the block is announced). Otherwise, a malicious collator could trick the next author into building with the wrong transactions, if there's no way of checking that the collator who is streaming the transactions has the right to do so. This is not an issue for Aura, but I wonder how we can solve this for SASSAFRAS.

bkchr commented 1 month ago

Ty for writing this up @eskimor. Generally it would have been nice to mention @skunert and me who have come up with this.

Some things to consider:

  1. The block producer will also need to send transactions and not just hashes. For example the inherents are produced by the block producer and need to be send to the other nodes, so that they have exactly the same. Some optimizations like maybe building some of them locally could be done.

5. We receive the block announcement from the block author and compare the state root with the one at which we arrived. If it matches, we have everything we need. If any of the above failed, we just fallback to normal block import and download the block as usual in that phase.

The block producer will send the final header over this protocol as well or better, only needs to send the Seal to finish the block. No need to wait for the block announcement. If the block is not the same, it will be discarded any way or there are some "deeper issues".

As long as we don't have data on mempool performance, I would consider this premature optimization. Relying on the mempool is certainly the easiest first step and is likely sufficient in practice. By adding another protocol we might only require more bandwidth (same transaction transmitted twice), while in the worst case not improving latency at all (by the time the request round trip is done the gossip message arrived too).

Yeah I agree here. Our transaction networking protocol is also rather shitty and could be improved by using a set reconciliation protocol, which should be the best we can do here.

Additionally, the next block author needs to know who is the current block author (before the block is announced). Otherwise, a malicious collator could trick the next author into building with the wrong transactions, if there's no way of checking that the collator who is streaming the transactions has the right to do so. This is not an issue for Aura, but I wonder how we can solve this for SASSAFRAS.

Generally not really an issue. The consensus protocol needs to send some kind of signature at the beginning of the block production to proof that it is the eligible block author to the others. We can probably reuse what we will use for the runtime to prove this already.

michalkucharczyk commented 1 week ago

The block producer will send the final header over this protocol as well or better, only needs to send the Seal to finish the block. No need to wait for the block announcement. If the block is not the same, it will be discarded any way or there are some "deeper issues".

Theoretically we don't even need Seal or block header. All we need is set of extrinsics (with root hash: eh) and the resulting state (also sh). We can do all the work speculatively (building block N [almost] in parallel with other validator, and N+1 which is the new one). If both hashes (sh,eh) match in the announced header of block N, then we can immedietaly announce N+1.

This would require some rework in transaction pool and block builder: we would need to work on some entities that are not real blocks yet, but it could simplify protocol -- all we would need is just set of ready_at transactions which is known when block is started to be built. This set of transactions would need to be sync'ed over the network with the next block builder.

bkchr commented 1 week ago

Theoretically we don't even need Seal or block header. All we need is set of extrinsics (with root hash: eh) and the resulting state (also sh). We can do all the work speculatively (building block N [almost] in parallel with other validator, and N+1 which is the new one). If both hashes (sh,eh) match in the announced header of block N, then we can immedietaly announce N+1.

How do you wan to know the resulting state before having applied the transactions? How do you want to apply transactions without having the input state? The seal is for sure required, because you want to ensure that only the eligible block author produced the block and not some random node that just wants to spam you.

michalkucharczyk commented 1 week ago

How do you wan to know the resulting state before having applied the transactions?

N-1 state is known on both builders. Also the set of transactions is known on both sides. So N can be built in parallel on both sides. Once we know N we can speculatively build N+1 (w/o waiting for any network traffic).

Once header is known we can decide if N+1 is legit (should be if everything goes fine).

bkchr commented 1 week ago

What you are describing is exactly what this issue is about. I mean you probably assume that the set of transactions is know on both sides and this is not true. You can not be sure that every node has the same view on the tx pool as you. Just imagine some tx that changes its priority based on its "birth" block or some node just haven't seen a tx yet. Another thing you are also forgetting is that transactions can panic, they should not, but they can.

In the end what we want to do here is to call i_will_apply_tx(hash) before calling apply.

michalkucharczyk commented 1 week ago

I know.

Sorry I did not make myself clear enough.

Also the set of transactions is known on both sides.

By this I meant that ready_at set was sync'ed on both sides at the beginning.

michalkucharczyk commented 1 week ago

Another thing you are also forgetting is that transactions can panic, they should not, but they can.

They will panic "deterministically", right? Also the number of transactions that fit into the block shall be exactly the same.

bkchr commented 1 week ago

By this I meant that ready_at set was sync'ed on both sides at the beginning.

What is the difference to just announce in between what you will apply? I mean it is just a "fire and forget". While a syncing require some sort of back and forth between all the nodes.

michalkucharczyk commented 1 week ago

I don't have exact numbers when it comes to timings. So maybe syncing will be more effective if the amount of data to be sent is huge, and txpools are not synced for some reason?

But also, this sync could be part of new transaction protocol which could include frozing/agreeing a set of transaction in the view, so streaming of ready is not really needed as protocol on top of that.

bkchr commented 1 week ago

But also, this sync could be part of new transaction protocol which could include frozing/agreeing a set of transaction in the view, so streaming of ready is not really needed as protocol on top of that.

Goes a little bit into the direction of: https://github.com/paritytech/polkadot-sdk/issues/5869

I don't have exact numbers when it comes to timings. So maybe syncing will be more effective if the amount of data to be sent is huge, and txpools are not synced for some reason?

I mean yes, I would cheat and assume that every node has the transactions and otherwise need to fall back to the transaction pool to somehow get them. Or the node just needs to wait for the final block to arrive. Inherents for example still need to be streamed, because they may depend on the exact time when they were created (like timestamp).

sandreim commented 1 week ago

In principle we just need a fast method for the next block author to fetch any missing transactions from current block builder. This can be done on-demand by requesting individual txs or batches. Perhaps a more efficient solution is for the block author to announce a speculative list of transaction it thinks fit in the next block. Then, the next block author can request the missing txs from the current block author, so we don't serialize multiple request RTT latencies. If there is more space, more txs can be added and streamed but this will eat into the next slot proportional to the latency between these author. So the latency between the current block author and next one determines the maximum authoring duration limit for the next block.

The diagram below shows how this would work with 3 collators with very little latency (10ms): tx_streaming drawio (1)

sandreim commented 1 week ago

For this to work all of the collators in the set need to do this, not just current and next block author, otherwise it doesn't work because you cannot catchup in good time to have post state at N-1 in order to be able to build N

alindima commented 1 week ago

A way of not having the overhead of all collators doing this, would be to:

sandreim commented 1 week ago
  • when the third collator gets the first block, also retrieve the storage post-state of the block, so that it can build the second block while importing the first one.

For the post state, we might need a lot more bandwidth and this is only available after you have executed the block.

bkchr commented 1 week ago

Sending out transaction hashes to all block builders should not be that expensive. For the first version I would also strictly assume that everyone has all the transactions. If not, it should fall back to the transaction pool. The transaction pool syncing right now is really dumb and doesn't do anything fancy, we had the idea to upgrade to some set conciliation algorithm. There exists ton in bitcoin etc that we can just use. If we have this, the likelihood of having transactions not available should be quite low.

bkchr commented 1 week ago

Even if the tx would be missing, it could for example be fetched from a different node. That only the block author has this transaction goes against 0.

sandreim commented 1 week ago

Even if the tx would be missing, it could for example be fetched from a different node. That only the block author has this transaction goes against 0.

Yeah, that makes sense. I would still like to get some data about how fast transactions propagate. I would expect that the inconsiscies between tx pools increases with the number of nodes ( collators + rpcs)

sandreim commented 1 week ago
  • when the third collator gets the first block, also retrieve the storage post-state of the block, so that it can build the second block while importing the first one.

For the post state, we might need a lot more bandwidth and this is only available after you have executed the block.

Actually we could stream the post state (keys, values) as the block is built. There would be N checkpoints. At each checkpoint the author sends the diff from last checkpoint to the next block author. The next block author can start building his own block as soon as the entire post state has been streamed. So this means that only the network latency will eat into his slot - no need to execute, but we do import the block in parallel. For this to work well we must also require collators to have good internet connection.

bkchr commented 1 week ago

Streaming the state makes no sense. A transaction has a size of some bytes and can do modifications of multiple megabytes. Really not sure why you want to do this.

sandreim commented 1 week ago

Streaming the state makes no sense. A transaction has a size of some bytes and can do modifications of multiple megabytes.

Agreed what you say is a downside, which I also mentioned before. But this is something that can be effectively improved by throwing money at it - increased bandwidth requirements for collators for elastic scaling. I wouldn't drop this idea without getting some hard data and measuring the actual size of these modifications.

In theory you can create such a transaction, but in practice this is most likely an outlier. Such a transaction would need to write garbage, otherwise it needs to come from pre-existing state which will need to be put into PoV and effectively placing a bound on the state modification sizes.

Really not sure why you want to do this.

Reasons why I would do it:

sandreim commented 4 days ago

Reasons why I would do it:

  • avoid more complexity: we don't need to touch the tx pool logic at all

Actually, this is not true. It gets complicated because we also need to stream the transaction hashes, so the next block author does not put the same transactions in it's block. Transactions would fail but authorship time will be lost.