availability-recovery: systemic chunks recovery hotpath

sandreim commented 1 year ago

Currently we randomize the chunks that we fetch and we don't reap any benefits from our erasure coding being systemic (chunks 0..f+1 are the exact copy of the PoV).

We should add another hotpath for doing so. This would probably go to the decoder and make the reconstruct very cheap or otherwise not call it at all, and have the subsystem assemble the PoV.

One concern here is that the we only change the validator indices on every session and these indices map directly to the systemic chunks indices right now. We need to spread the load across all validators so we don't overload the same validators for an entire session. We'd have to change the way the validator indices who have the systemic chunks are mapped. We could make this mapping to depend on something other than just validator index which must be feasible to obtain in all contexts - approval checking and disputes.

CC @burdges for more suggestions on the new mapping

burdges commented 1 year ago

we only change the validator indices on every session and these indices map directly to the systemic chunks indices right now.

We screwed this up then. We should've permuted the map between validator indices and chunk indices based upon the epoch randomness and slot. I'd asked for this originally, and proposed code, but I guess it got omitted.

ordian commented 1 year ago

One concern here is that the we only change the validator indices on every session and these indices map directly to the systemic chunks indices right now

are you sure that's the case? I remember hearing different from @drahnr also you can test that here https://github.com/paritytech/polkadot/blob/f900dede87efcfd91eb5656ed78077ed3877f323/erasure-coding/src/lib.rs#L362

sandreim commented 1 year ago

Looked at the code and ValidatorIndex is used as chunk index.

I am not sure how that test would help regarding the mapping I mentioned.

ordian commented 1 year ago

Yes, but this is not true:

chunks 0..f+1 are the exact copy of the PoV

(or that there's a subset of f+1 chunks that is the exact PoV clone)

sandreim commented 1 year ago

Ok, I see what you mean, but as per discussion with @burdges this was the design and should be true. My assumption was the implementation does that (didn't check)

sandreim commented 1 year ago

@ordian you are correct, systemic chunks are not implemented. Introducing this is a breaking change which might not be worth the effort to do given that it only wins us the reconstruct time. However, we'll still keep it for later if needed.

rphmeier commented 1 year ago

The two breaking changes that can go in at the same time:

Systemic chunks implementation
Permuting validator indices in erasure-coding for each available datum (probably not the relay-parent, as with asynchronous backing + paritytech/polkadot-sdk#607 we may be bundling many different relay parents together on a single availability core)

FWIW when it comes to building fast-paths it'd also be good to have approval checkers help others get the systemic chunks. i.e. the availability recovery protocol should allow fetching multiple chunks from the same peer, and approval checkers / dispute participants should serve systemic chunks for things they've recently validated. Then approval-checkers can just see which other validators have approved the candidate and add them to their set of peers to ask from.

burdges commented 1 year ago

(or that there's a subset of f+1 chunks that is the exact PoV clone)

This should be true.

I wanted the relay chain slot and epoch randomness in this mapping anyways. If we use the block height instead, then we can easily code it to activate at a specific block height, and likely we can remove the dependency upon the epoch randomness.

use fpe::ff1::{FF1, FlexibleNumeralString};

pub enum AvailabilityFF1(Option<FF1>);

impl AvailabilityFF1 {
    fn new(blockheight: u64, num_validators: u16) -> AvailabilityFF1 {
        if blockheight < WHATEVER { return AvailabilityFF1(None); }
        use rand::{SeedableRng, seq::SliceRandom};
        let mut seed = [0u8; 32];
        seed[0..4].copy_from_slice(& blockheight.to_by_bytes());
        let mut rng = rand_chacha::ChaChaRng::from_seed(seed);
        // We could create a map table like this
        // let mut map: Vec<u16> = (0..n).collect();
        // map.as_mut_slice().shuffle(&mut rng);
        // but in theory format preserving encryption might save allocations and cache,
        // although not with its current Vec usage in the fpe crate.
        rng.fill_bytes(&mut seed);
        let ff1 = FF1::<aes::Aes256>::new(&seed, num_validators).unwrap();
        AvailabilityFF1(Some(ff1))
    }

    // FlexibleNumeralString is ridiculously inefficient here, but we could
    // send the fpe crate a PR that impls their NumeralString and
    // Operations for arrayvec::ArrayVec;

    fn map_validator_to_chunk(&self, validator_index: u16) -> u16 {
        let ff1 = match self { Some(ff1) => ff1, None => return validator_index, }
        let pt: FlexibleNumeralString = vec![validator_index; 1];
        ff1.encrypt(&[], &::from_bytes_le(&pt)).unwrap()[0]
    }

    fn map_chunk_to_validator(&self, chunk_index: u16) -> u16 {
        let ff1 = match self { Some(ff1) => ff1, None => return chunk_index, }
        let ct: FlexibleNumeralString = vec![chunk_index; 1];
        ff1.decrypt(&[], &::from_bytes_le(&ct)).unwrap()[0]
    }
}

We do not afaik care that adversaries can place their own validators in systemic positions for specific blocks.

Edit: Asked https://crypto.stackexchange.com/questions/107291/very-small-domains-in-ff1

Update: I think https://crypto.stackexchange.com/a/107432/764 provides a nice shuffle which suffices for us.

sandreim commented 1 year ago

FWIW, we can still use the hot-path of fetching from backers at least to some higher upper limit than now to achieve the same cost reduction as systemic chunks. Once this and SIMD improvements land we can reduce the pressure on the backers, but otherwise based on the Glutton testing done so far, the backers seem to be able to sustain the load.

burdges commented 1 year ago

Yeah, you can fetch from them even if they're not randomized, but it risks overloading them of course, hence the randomization.

sandreim commented 1 year ago

Currently, the backing group is shuffled before we attempt fetching the available data.

burdges commented 1 year ago

oops, sorry I miss-read your comment above. Yeah I think you can fetch the systemic chunks by index too, if you try harder to get them from backers.

We can also just put a shuffle into the indices, like was originally supposed to be there.

alindima commented 1 year ago

An interesting topic that @sandreim brought up is how can we make sure that all validators switch to the new "validator shuffle" all at once. If they don't, honest validators looking to retrieve systematic chunks could get valid chunks that are not systematic (and I don't know a way to cheaply check that a chunk is systematic without having all of them and reconstructing, is there?).

We can code it in such a way that getting a valid non-systematic chunk is not fatal, because we'd fall back to reconstructing from regular chunks (I'm working on a POC that does this), but this would also defeat the purpose of the hot-path if it's happening often enough.

What would be the best way to handle this upgrade? Andrei suggested adding a field in the HostConfiguration that enables this and an associated runtime API for querying it. If the field is true, perform availability distribution and recovery according to the new shuffling algorithm. This would allow us to enable the feature only once sufficient validator nodes have been upgraded to the new client version and would enable the feature all at once, via a runtime upgrade.

We would still keep the fallback to regular chunk recovery, in case some validator with old client code comes up or an alternative client implementation is used (like gossamer).

alindima commented 1 year ago

Another topic I'm curious about is the algorithm used for shuffling the validator indices.

@burdges you mentioned using the one from https://crypto.stackexchange.com/a/107432/764. I'm curious why not stick to a rand::shuffle. Is it too slow? My guess is that it's also because the underlying implementation of rand::shuffle could change between different versions of the code, which would introduce discrepancies between validators. Is that why?

burdges commented 1 year ago

I think rand::shuffle is fine.

It'll need 2 num_validators bytes of output, which costs 10-15 cycles per byte, based upon page 2 of https://cr.yp.to/chacha/chacha-20080128.pdf so then 20-30 k cycles per schuffle. It's not ideal if you recompute the shuffle every time you do a lookup, but it's fewer cycles than decrypting the chunk, so not too bad even there.

You can cache the shuffle, but I wanted to see if there was something completely trivial that avoided even caching the shuffle.

burdges commented 1 year ago

Actually ChaCha8 should be fine by https://github.com/rust-random/rand/issues/932 so roughly..

pub struct ValidatorShuffle(Vec<u16>);

impl ValidatorShuffle {
    fn new(num_validators: u16, seed: [u8; 32]) -> ValidatorShuffle {
        use rand::{SeedableRng, Rng};

        let mut indices: Vec<i32> = (1..num_validators).collect();
        let mut rng = rand_chacha::ChaCha8Rng::from_seed(seed);

        // Reimplement rand::seq::SliceRandom::shuffle for 2x the speed
        for i in (1..(self.len() as u16)).rev() {
            // invariant: elements with index > i have been locked in place.
            let j = rng.gen_range(0u16..(i+1))
            self.swap(i as usize, j as usize);
        }
        ValidatorShuffle(indices)
    }
}

ordian commented 1 year ago

What would be the best way to handle this upgrade? Andrei suggested adding a field in the HostConfiguration that enables this and an associated runtime API for querying it. If the field is true, perform availability distribution and recovery according to the new shuffling algorithm.

This field would waste space in the config and require a migration for no reason. An alternative would be to use runtime API versioning mechanism and add a new method which returns nothing, but only enabled in the new version https://github.com/paritytech/polkadot-sdk/blob/c5f7abe27f24e67e33c516b9ddd5d91812451c37/polkadot/primitives/src/runtime_api.rs#L129

When it comes to shuffling, we already do this in the runtime: https://github.com/paritytech/polkadot-sdk/blob/c5f7abe27f24e67e33c516b9ddd5d91812451c37/polkadot/runtime/parachains/src/shared.rs#L192-L199

Although this happens once per session. But even when doing this every block, I doubt the performance hit is of any relevance.

sandreim commented 1 year ago

What would be the best way to handle this upgrade? Andrei suggested adding a field in the HostConfiguration that enables this and an associated runtime API for querying it. If the field is true, perform availability distribution and recovery according to the new shuffling algorithm.

This field would waste space in the config and require a migration for no reason. An alternative would be to use runtime API versioning mechanism and add a new method which returns nothing, but only enabled in the new version

Yes there would be a migration, but not really useless if we adding a configuration parameter for future features. It should be a bitfield which bits represent an enabled feature/improvement. It should work for tranche0 v2 assignments and approval coalescing as well as system chunks.

But yeah, we could do the same with a runtime_api that returns this bitfield.

alindima commented 1 year ago

I think rand::shuffle is fine.

Thanks for the clarification.

What about this? My guess is that it's also because the underlying implementation of rand::shuffle could change between different versions of the code, which would introduce discrepancies between validators Do you think it's a valid concern? I guess it'll be solved if we use ChaCha8 directly and do the shuffling ourselves

alindima commented 1 year ago

It should be a bitfield which bits represent an enabled feature/improvement. It should work for tranche0 v2 assignments and approval coalescing as well as system chunks.

I like this idea.

But yeah, we could do the same with a runtime_api that returns this bitfield.

The small caveat of not persisting the bitfield in the state is that we won't be able to use sudo or governance to set them without a runtime code upgrade. I think it's helpful to have this option, even if it's only for easier debugging

ordian commented 1 year ago

Since systematic chunks happy path is easily broken by 1 offline validator/missing chunk, what if we require every validator to store 1 additional systematic chunk along with their assigned chunk? In total there would be 3x systematic chunks stored at the cost of 6x blowup of the data instead of 3x, but the benefit is we can tolerate more offline validators for happy path.

Whether it's worth will depend on how much can the reed-solomon recovery impl be optimized.

burdges commented 1 year ago

it'll be solved if we use ChaCha8 directly and do the shuffling ourselves

It's just the for loop I wrote above to get a 2x speed up. sure :)

Since systematic chunks happy path is easily broken by 1 offline validator/missing chunk

Only if backers are malicious. I'd assume systematic chunks often come from backers, almost like super-seeding in bittorrent.

what if we require every validator to store 1 additional systematic chunk along with their assigned chunk?

At present our bandwidth comes from:

collators sending candidates to the k-of-n backers,
backers sending out the chunks, and
approval checkers fetching the blocks.

We therefore have the cost n + 3 + target_checkers + sigma < 40, multiplied by the candidate size. so you'd add 1 to this coefficient, maybe like 2.5% overhead. It's not bad, but also we already have k+1 to n+1 nodes who have the systemic chunks, to which you'll only add 1.

It imho more important to fetch systematic chunks from other approval checkers, because this adds more than 1, with zero extra cost, and works better in adversarial scenarios where some approval checkers do reconstruction. It's higher latency than what you propose, but that's fine maybe?

At the extreme approval checkers could share chunks even before doing their approval, but then we've basically reinvented bittorrent.

alindima commented 1 year ago

I posted a PR with the implementation: https://github.com/paritytech/polkadot-sdk/pull/1644. I have tested it locally with zombienet and it works.

I haven't yet measured the performance in versi and didn't add extensive tests, but I'm looking for some feedback.

alindima commented 1 year ago

The PR I linked above implements systematic chunk recovery where we spawn parallel requests to get systematic chunks from the validators that should be holding those chunks. This will alleviate the CPU cost of decoding the available data (which is quite high for large POVs).

The availability-recovery process would pretty much look like this after the PR is merged:

if we estimate that the POV size is smaller than our SMALL_POV_LIMIT (currently hardcoded to 128Kib, wondering how we reached this value), we try to retrieve the full POV from the backers we're connected to, one by one.
otherwise, attempt recovering the systematic chunks in parallel (with a limit on how many parallel requests to spawn, currently 50), from the n/3 validators that should hold them.
if we didn't get all the systematic chunks we needed, request the remaining number of regular chunks.

If at any point in the process, one step fails, we fall back to the next one (the recovery is not just bailed because all backers are unresponsive or because we got a wrong systematic chunks, we fall back to the next strategy).

Idea for a next step

My intuition (and please weigh in with yours) is that another area that needs improvement is amortizing the cost of issuing network requests by batching systematic chunk requests somehow (meaning that we don't want to request all the systematic chunks from different validators, unless we really have no other option). From a brief look over the grafana dashboard of kusama validators, the average chunk request duration is of about 300ms. Considering that we are not requesting all chunks in parallel (we only do 50 at a time, presumably to not starve other subsystems), the total waiting time would be around 300ms * (n_validators / 3 / 50), which amounts to about 2 seconds for 1000 validators.

I propose adding a new fast-path strategy that, for large POVs does the following:

The strategy will operate on the set of validators that should hold the entire data (backers and approval checkers, for when the recovery is being done for later tranches or disputes). The respective approval checker set can be passed to the availability recovery subsystem as a field on the AvailabilityRecoveryMessage::RecoverAvailableData message.
It splits the estimated pov size into equally-sized sequences of systematic chunks. The sequence size can be chosen so that the requested data is roughly equal to SMALL_POV_LIMIT, which we can tweak based on measurements to get the best performance.
Launches parallel requests to all validators in the set to get the chunks in their assigned sequence (the desired chunks will be encoded as a bitfield in the request message). The validator will respond with the chunks that were requested (if it holds them). Additionally, it will send back a bitfield of the systematic chunks it holds (which can be used in a later step, if some other validator did not respond with the data it was supposed to).
Record the missing chunks and request them from the other validators that attested holding those chunks. Do this until the recovery is complete, by jumping back to step 3. If at any point a validator attested that they own a chunk but did not send it, remove them from the validator set. Also remove any validators that were unreachable.

This strategy assumes that most validators are honest, but will fall back to requesting the missing systematic/regular chunks one by one from all validators if it's not able to recover or detects dishonest behaviour. I think it builds nicely on the systematic chunks recovery and will enable us to leverage backers/approval-checkers that hold the data (for large POV sizes as well; we're now only leveraging backers and only for small PoVs).

WDYT?

eskimor commented 1 year ago

Whether 50 is a good limit is indeed not clear. To me it was clear though, that there should be some limit. I would expect performance to degrade with too much parallelism: TCP streams will be in competition with each other. In other words: If you request from 50 validators, they have 50 times the bandwidth you have, therefore I would expect congestion, causing lost packages, retransmissions, ... The other concern is of course latency, which justifies higher parallelism. We could arrive at better values with extensive testing/benchmarking. It should be worthwhile to monitor how the chunk fetch duration varies with different numbers for parallel requests. I doubt it is a constant, but seeing how it behaves should give good indications on what might be better strategies.

The other concern I would like to deposit here: With all optimizations we do here, we should try our best to even out load among validators. In fact, these optimizations might lead us to be able to actually distribute load even better than we do now, as randomized fetching does not result in equally distributed load at all. Ideally we would have deterministic distribution of systemic chunks, so that for each candidate in a block a different set of validators will be having systemic chunks.

Haven't had a closer look at the systemic chunk PR yet, but superficially it looks good - great work. We should talk with @burdges whether randomized shuffling is necessary or we could do load distribution optimized deterministic shuffling.

alindima commented 1 year ago

Whether 50 is a good limit is indeed not clear. To me it was clear though, that there should be some limit. I would expect performance to degrade with too much parallelism: TCP streams will be in competition with each other. In other words: If you request from 50 validators, they have 50 times the bandwidth you have, therefore I would expect congestion, causing lost packages, retransmissions, ... The other concern is of course latency, which justifies higher parallelism. We could arrive at better values with extensive testing/benchmarking. It should be worthwhile to monitor how the chunk fetch duration varies with different numbers for parallel requests. I doubt it is a constant, but seeing how it behaves should give good indications on what might be better strategies.

I agree that there should be some limit 👍🏻 I also agree that benchmarking this value as well as the appropriate SMALL_POV_LIMIT is worthwhile. Do you know how the latter was determined/guesstimated?

The other concern I would like to deposit here: With all optimizations we do here, we should try our best to even out load among validators. In fact, these optimizations might lead us to be able to actually distribute load even better than we do now, as randomized fetching does not result in equally distributed load at all. Ideally we would have deterministic distribution of systemic chunks, so that for each candidate in a block a different set of validators will be having systemic chunks.

With my PR, we're not only randomizing the order in which we fetch the 1..n/3 systematic chunks. Obviously, this yields no load distribution within a session if we're always querying the first n/3 validators. Instead, I deterministically shuffle the ChunkIndex -> ValidatorIndex mapping once every relay block number (using the block number as the seed for the shuffle). With this, the validators that hold the systematic chunks will be chosen at random for every relay parent (but they'll be the same for different candidates/paras within the same relay parent block or slot).

burdges commented 1 year ago

We should talk with @burdges whether randomized shuffling is necessary or we could do load distribution optimized deterministic shuffling.

We do not know which validators are lazy so there should be some deterministic random shuffle of validators, but yes the shuffle need not depend upon the parachains.

We've more questions:

Should the deterministic random shuffle depend upon the relay chain slot number? We should shift around more directly within some fixed shuffle. We should not just have three start positions, 0, n/3, and 2n/3 in a shuffle not dependent upon slots, because then one bad validator in each segment slows everything down.
If no, then should the relay chain block producer influence the systematic chunk positions? Aka should the mapping be purely a function of core, relay chain block height, and epoch? Or should we spread out the cores evenly based upon what gets included in that block?
If we'd information about validator laziness then weighted shuffles become possible here. We do not want the weights to be too strong of course.

I'd honestly punt all these questions: Our primary problem is basically the bittorrent problem. We know both the chunk holder and the backers have our desired systematic chunk. How do we optimally fetch exactly the same data from one of 2-6 sources?

I'd propose:

We seed a ChaCha8Rng with epoch randomness and relay chain slot number. We shuffle the validator indices based upon this using a full shuffle, like rand::seq::SliceRandom::shuffle but u16s. We next sample core number start positions randomly without replacement using a partial shuffle, again like rand::seq::SliceRandom::partial_shuffle but using u16s. This is not much code, just two loops, plus all the plumbing to access these maps. Also this assumes we've more validators than cores, but that's probably okay for quite a while.

We then spend most of the development time on the bittorrent problem of quickly choosing when we fetch the systematic chunks from backers or chunk holders. I'd expect bittorrent does multiple fetches but then tells them to stop sending.

We can afford some extra bandwidth here of course, so if you've fetched a few non-systematic chunks already then you also need to choose when you abandon the systematic ones.

ordian commented 1 year ago

I deterministically shuffle the ChunkIndex -> ValidatorIndex mapping once every relay block number (using the block number as the seed for the shuffle).

Regarding using only block numbers for the randomness: we could have two blocks at the same height but in different sessions (e.g. after a reversion). So we should probably take SessionIndex into account to ensure that ValidatorIndex corresponds to the proper session, no?

I like the idea of batching requests. In the end we end with a design similar to Tiramisu DA (section 1.4.2). They add a CDN layer as a fast-path cache for requesting data, whereas we have that naturally with backers/approval-voters.

We could already add backers for fetching missing systematic chunks as a fallback already since they should have all.

eskimor commented 1 year ago

I agree that there should be some limit 👍🏻 I also agree that benchmarking this value as well as the appropriate SMALL_POV_LIMIT is worthwhile. Do you know how the latter was determined/guesstimated?

Guestimate by @sandreim .. had good results with that value. Not sure if different values have been tried.

alindima commented 1 year ago

We seed a ChaCha8Rng with epoch randomness and relay chain slot number. We shuffle the validator indices based upon this using a full shuffle, like rand::seq::SliceRandom::shuffle but u16s. We next sample core number start positions randomly without replacement using a partial shuffle, again like rand::seq::SliceRandom::partial_shuffle but using u16s. This is not much code, just two loops, plus all the plumbing to access these maps. Also this assumes we've more validators than cores, but that's probably okay for quite a while.

This makes sense. I'm working on implementing it 👍🏻 @ordian proposed the same algorithm 😄

I like the idea of batching requests. In the end we end with a design similar to Tiramisu DA (section 1.4.2). They add a CDN layer as a fast-path cache for requesting data, whereas we have that naturally with backers/approval-voters.

This sounds incredibly similar to what we have/plan, minus all the dessert names

We could already add backers for fetching missing systematic chunks as a fallback already since they should have all.

Yes, that's true and a good improvement idea

Regarding using only block numbers for the randomness: we could have two blocks at the same height but in different sessions (e.g. after a reversion). So we should probably take SessionIndex into account to ensure that ValidatorIndex corresponds to the proper session, no?

hmm, as far as I understand, this would only mean reusing the same validators for the systematic chunks recovery in two different sessions at the same block height, in adversarial (and currently very rare) scenarios.

However, I'll modify my PR to derive the seed from the babe epoch randomness combined with the block height, as purely monotonical seeds (such as block numbers) are not good a good idea for PRNGs anyway (from what I read)

sandreim commented 1 year ago

I agree that there should be some limit 👍🏻 I also agree that benchmarking this value as well as the appropriate SMALL_POV_LIMIT is worthwhile. Do you know how the latter was determined/guesstimated?

Guestimate by @sandreim .. had good results with that value. Not sure if different values have been tried.

I did this some time ago and I consider the value is large enough such that on Polkadot or Kusama, most PoVs are fetched from the backing group.

Based on experiments with gluttons I propose we should bump it to 2.5MiB which is half the max PoV size. Actually It's very simple to reason about the value, it is just used to determine how much pressure we want to put on the backers vs the strategy of choice for retrieving the PoVs. In the context of this PR, it makes even more sense to have a higher value especially for tranches > 0.

FWIW, I like the idea of being able to batch and request multiple chunks at once, in terms of latency this would be far better than current approach of getting everything from a single guy happy path.

Regarding shuffling, I don't really think we need a new shuffle. Just for the purpose of load balancing we can derive chunk holders from validator shuffling we already do. Assuming it is not a secret who the systemic chunk holders are at any given point in time, why doesn't a simple solution work? Something like the first K validator indices starting from (N+i) % n_validators hold system chunks at a given relay block, where i is candidate inclusion index, N is the relay chain block number modulo.

alindima commented 1 year ago

Based on experiments with gluttons I propose we should bump it to 2.5MiB which is half the max PoV size. Actually It's very simple to reason about the value, it is just used to determine how much pressure we want to put on the backers vs the strategy of choice for retrieving the PoVs. In the context of this PR, it makes even more sense to have a higher value especially for tranches > 0.

FWIW, I like the idea of being able to batch and request multiple chunks at once, in terms of latency this would be far better than current approach of getting everything from a single guy happy path.

We could increase this limit until we implement batching of requests. After that is implemented, I expect that it'll be worth querying a single validator only for the smallest of POVs.

The way I view it, this limit would be chosen by finding the sweet spot where it's worth trading off latency for throughput. If we're requesting the full POV from one validator, we're relying only on a single connection throughput. If we're requesting disjoint pieces of the POV from K validators, we trade off the added latency of sending requests to multiple peers, for a factor of K throughput multiplication. I think that after we implement said batching, we can measure this limit in practice and see the exact point where it becomes more beneficial to request from multiple validators at the same time.

Regarding shuffling, I don't really think we need a new shuffle. Just for the purpose of load balancing we can derive chunk holders from validator shuffling we already do. Assuming it is not a secret who the systemic chunk holders are at any given point in time, why doesn't a simple solution work? Something like the first K validator indices starting from (N+i) % n_validators hold system chunks at a given relay block, where i is candidate inclusion index, N is the relay chain block number modulo.

availability-distribution and bitfield-signing don't have access to the candidate inclusion index. That could be replaced by the core index. I think the formula you propose would have very large overlaps and wouldn't distribute load that well.

sandreim commented 1 year ago

I think the formula you propose would have very large overlaps and wouldn't distribute load that well.

Yeah, it doesn't spread it well per block, but over multiple blocks the load basically is cycled. To make it spread better we could do instead (N+i*c) % n_validators where c is n_validators / n_occupied_cores

alindima commented 1 year ago

It needs to be easily computable in approval-voting, dispute-coordinator and cumulus pov-recovery (the three actors that can trigger availability-recovery). Getting the occupied core index of the candidate is quite complicated in dispute-coordinator (we'd either have to change the network protocol to include this index in the message, do some chain scraping or record this data in availability-recovery; all options seem quite overly complicated to me).

I think a better option would be to use the paraid, which is readily available in the candidate receipt. To avoid even querying the runtime for the paraid index, we could seed another Chacha8Rng with the paraid value and just generate one random number to get the para-specific offset into the larger shuffle

alindima commented 12 months ago

I've measured performance on Versi with my current implementation, as in: https://github.com/paritytech/polkadot-sdk/pull/1644, and here are the numbers:

Network params:

50 validators
5 max validators per core
10 glutton parachains, creating POVs of 2.5 mb
Async backing enabled

95 percentile for duration of full POV recovery (including network IO and erasure decoding/coding)

Full recovery from backers (only one backer is used to retrieve the full, unencoded POV): 0.507 seconds
Regular chunks recovery (as is currently being done in polkadot for large POVs): 1.18 seconds
Recovery from systematic chunks: 0.651 seconds

Here's a heatmap that visualizes the difference (first portion is systematic recovery, second portion is regular chunk recovery):

Conclusion: regular chunk recovery takes 80% more time than systematic recovery. The significant difference here is that systematic recovery doesn't do the decoding step, but still does the reencoding for verification purposes. This means that RS decoding step accounts for 80% of the total POV recovery. Another interesting thing to note is that recovery from just one backer is still 22% faster than systematic recovery, even for large POVs. While doing recovery just from the backing group would not distribute load well, we can think of more sophisticated strategies that try requesting batched systematic chunks from backers and switch to non-batched recovery from all validators if the backing group is overloaded. However, it's a pretty small difference IMO (150 ms) that may not justify the added complexity. I propose we take a pragmatic approach and first see if av-recovery is still a bottleneck after systematic recovery.

Average CPU utilisation for `erasure-task`

Erasure-tasks are responsible for doing the erasure coding/decoding while recovering availability data.

CPU consumption for erasure task is halved (from 20% to 10% when doing systematic recovery).

The first bump corresponds to systematic recovery. The break is network downtime while upgrading the nodes. The second bump is regular chunk recovery.

IMO this is as good as we can get with CPU consumption, in terms of recovery strategies. The only improvements here can come by switching the erasure coding algorithm or implementation. This is because we still encode the full data after we receive it, to reconstruct the erasure trie and check the erasure root.

paritytech / polkadot-sdk