Design: Horizontally Scaling Data Availability Across Cores

rphmeier commented 1 year ago

Problem Statement:

A single Polkadot core provides 5MB of data per parachain block at a frequency of 6s, for a total of 0.83MB/s data throughput per core. To be truly competitive as a data availability service, Sugondat must utilize multiple cores.

Design Goals:

Scale to use as many cores as possible
Use the existing Parachain framework to access cores rather than waiting for CoreJam
Light-client friendly observing and fetching
Censorship Resistant Blob Submission
User fee payment per blob
Nice to have: hide load balancing from the user
Nice to have: support auto scaling
Nice to have: fast paths for data fetching (to avoid hitting the Polkadot validator set)

The nice-to-have goals are not necessary for any initial design but are worth thinking about now. Furthermore, we should be comfortable using a centralized architecture in an initial rollout, as long as there is a clear path to a decentralized architecture.

Approach 1 draft: Coordinated Worker model

In this model, we would have N Worker chains and 1 Aggregator chain. Each worker would simply make blobs available by bundling them in blocks. A single aggregator chain would scrape the worker chains and construct some kind of indexed data structure (such as an NMT) in each block which contains a reference to an application identifier, the blob hash, and a reference to the worker chain ID and block hash. Users would submit blobs in a p2p mempool which spans all the worker chains, with some mechanism which ensures that each blob is only posted to a single worker chain.

Compared against design goals:

With respect to goal (1), this scales up to the point where the aggregator chain or worker chain collation is the bottleneck. This depends on how expensive the scraping is.

Light clients, for goal (3), would need to follow only the aggregator chain, and then fetch block bodies for various worker chains. Bridge-based light clients only need the data availability commitment, which is guaranteed by bridging over the aggregator chain alone. The worker approach also assists with goal (9), as in the fast path the light client can fetch the block body from any full node of the worker chain. In the case of a Polkadot DA fallback, the light client needs the candidate hash to make the ChunkFetchingRequest to relay chain validators, and can only get the candidate hash by scraping relay chain block bodies.

Goals (4), (5), and (6) highlight the difficulty of this approach:

User balances must be global across all workers
Collators for each worker chain should be coordinated to avoid submitting blobs to multiple workers simultaneously

This implies that the worker chains should be advanced by the same pre-consensus process and collator set which is shared among all workers. This pre-consensus process would be responsible for determining what goes into each worker chain block at each moment, and would need to adapt to Polkadot faults such as inactive backing groups.

Approach 2 draft: Uncoordinated Worker Model

The uncoordinated worker model is similar to the coordinated worker model, but sidesteps the need for a pre-consensus by having workers operate independently without any consensus-level mechanism to avoid blobs being posted twice. Users would keep balances on each worker chain and submit blobs only to a specific worker chain. An aggregator chain would still track the blobs published on each worker chain for easy indexing, but load-balancing would become a user-level concern.

rphmeier commented 1 year ago

The aggregator chain seems a necessity to me simply due to goal (3), but I may be overlooking approaches that don't involve spending a core on aggregation alone.

pepyakin commented 1 year ago

Do we actually need to spend a core solely for aggregation? Maybe instead, all cores are occupied by workers, but one of them also performs the additional duty of aggregation? This way, it could start with one core and gradually scale to more cores.

Scaling-by-worker approaches can indeed help scaling up. Well, kind of, because it's not clear how to scale them down, which in turn, affects the decision of scaling up. Specifically, if you cannot easily remove worker chains then the decision of scaling up should not be taken lightly because the ATH workload spike might be (and often will be) temporary.

Can we easily remove the chains? What does it even mean? I guess removing a worker chain is always connected with adding a chain. Who and when makes such a decision? I think we should have at least a vague idea about those.

Another question is, when technology allows, do we want to migrate to a single chain or still rely on multiple cores solution? If we will merge the workers, how would that look like? What happens with the data of those chains and won't it be an eternal burden for the chain?

Also worth thinking of who are our users? It seems to me that our users are never the end-users and as such we may be able to afford to put some burden on them.

Maybe it's OK to make them responsible for picking the least congested chain for their blob, because the end user in this case would be a sequencer.
It seems to me that the end-users are supposed to store the blobs long term if they are interested in it. At least rollup full nodes will have to store their data to provide other nodes with ability to sync from genesis.

Answering some of those questions may inform our design better and/or maybe unlock new design-space directions.

rphmeier commented 1 year ago

Thanks for the questions! I've tried to answer them below and give rationale for the ones punted on.

(we discussed some of this on the call today, but writing it down for posterity)

Do we actually need to spend a core solely for aggregation

Probably not. The aggregation procedure is fairly light. Let's assume a very basic aggregation approach, where each worker's head-data on the relay chain contains a commitment to a light hash chain of Vec<(blob hash, namespace)>. This will be ~32 + (N * 64) bytes per worker chain block with N blobs.

accept a relay-chain state proof of each worker ParaId's head-data
accept each new hash-chain item for each worker (the aggregator collator would require these)
aggregate all new hash-chain items into an NMT

Within a single core, we'd likely be able to handle hundreds of worker chain aggregations before hitting the data limit, and that's assuming this very simple protocol which is unlikely the most optimized.

I believe we should optimize for usability, however, expressed in two ways:

There should be a single chain to follow for DA aggregations, which should change only rarely with what are effectively hard forks of all hosted rollups.
The latency of aggregation should be low.

Taking (1) and (2) together implies that we should have a single chain which is scheduled as frequently as possible, though with CoreJam eventually it need not take up the entire core.

Specifically, if you cannot easily remove worker chains then the decision of scaling up should not be taken lightly because the ATH workload spike might be (and often will be) temporary

I'd like to punt on fully solving this problem for now, but we should formalize what it means for a worker chain to be "up":

The worker chain's ParaId is in some set maintained by the aggregator
The worker chain is scheduled and progressing

When the worker chain is up, all blobs should be aggregated. When it is down, no blobs should be aggregaed.

With this in mind, it should be possible to design some onboarding and offboarding procedures for worker chains to ensure an agreement between the aggregator and worker chain about whether it is up or down.

Who and when makes such a decision? I think we should have at least a vague idea about those.

I don't have a good answer to this and would prefer to decouple it via a call and permissioned origin, whatever that may be - sudo, multisig, token governance, defer to DOT governance via XCM are a few options that come to mind and several of them are more driven by legal considerations than technical ones.

when technology allows, do we want to migrate to a single chain or still rely on multiple cores solution? If we will merge the workers, how would that look like? What happens with the data of those chains and won't it be an eternal burden for the chain?

TBH it is unclear that technology will allow this. Elastic scaling allows a single parachain to be pushed to its sequential limit, but a multi-worker system without some kind of pre-consensus is already parallelized and would not benefit from such a change. I would also prefer to punt on this. Any upgrade would essentially be a hard fork of all hosted rollups, so at the point of doing such an upgrade we could effectively restart from scratch (with balances imported from the formerly many chains).

Also worth thinking of who are our users? It seems to me that our users are never the end-users and as such we may be able to afford to put some burden on them.

Yes, some technical burden on sequencers / library authors is fine (this may just be us), but concerns like latency & throughput do directly affect the end users.

Maybe it's OK to make them responsible for picking the least congested chain for their blob, because the end user in this case would be a sequencer

While I agree it's OK, it is quite difficult for a sequencer to pick the worker chain which is going to be the least congested without maintaining a mempool view of each worker. And conditions can change rapidly between the time a decision is made and the time to author the next worker chain block. Likewise, there is a strong probabilistic argument that if the amount of total DA load is normally distributed and that worker chains are chosen at random, the likelihood that any particular worker chain is congested is quite high due to the pigeonhole principle even when the amount of total pending blobs can be met by the throughput of the system as a whole. TBH, this also opens up a new question on pricing, because congested chains will cost more.

There is a possible set of approaches where each blob can actually land on any worker chain, but only a single one per relay-parent, and only if it has not been included on any other worker chain already...but I'm not sure exactly how this would look. Punting...we can decouple the logic of how blobs land on worker chains and are paid for from the multi-worker and aggregation process itself.

It seems to me that the end-users are supposed to store the blobs long term if they are interested in it. At least rollup full nodes will have to store their data to provide other nodes with ability to sync from genesis.

Yes, absolutely. The nice-to-have point (8) is more about the initial fetching of the blobs by rollup full nodes, by giving them fast paths which don't involve hitting the Polkadot validator set unless necessary.

noah-foltz commented 11 months ago

Furthermore, we should be comfortable using a centralized architecture in an initial rollout, as long as there is a clear path to a decentralized architecture.

What would an ideal timeline look like? If speed is of the essence, approach two with a centralized load balancer between the user and cores might be a useful path, only forgoing (4) in the short term. Then in a longer term implementation, build out approach 1's pre-consensus. However, the load balancer might no longer be useful after the pre-consensus process is built out.

thrumdev / blobs

Design: Horizontally Scaling Data Availability Across Cores #4