Open sandreim opened 1 year ago
related to https://github.com/paritytech/polkadot-sdk/issues/968 https://github.com/paritytech/polkadot-sdk/issues/598
We should be able to write fairly robust netcode which lets approval checkers / dispute participants fetch the data while working together, i.e. participate in a torrent-style distribution with anyone else who has or is currently fetching the PoV.
prerequisite for the implementation is that approval checkers must store the AvailableData in the av-store
Having approval checkers keep an in-memory cache of the systemic chunks of every unfinalized PoV they've been responsible for seems pretty reasonable. Except they'd have to vacate it to disk, or start throwing away chunks randomly once the cache starts to get overfull. Seems OK to reserve 200-500MiB for this purpose.
This logic should live in the availability recovery subsystem, where we can request information about other peers which have the PoV or are likely recovering it simultaneously from other subsystems.
Another approach that might be interesting is for nodes to pick a couple of peers which have the full PoV or parts of it, and ask those nodes to stream it fountain-coded chunks generated using the solition distribution until it asks them to stop (i.e. it's received the full PoV).
e.g.
enum DataStreamingMessage {
// A request, probably should be req/res with a negotiation.
RequestChunks {
candidate_hash: CandidateHash,
bitrate: u32. // requested bits/s
},
// Inform a peer to reduce the rate of streaming or stop entirely.
ReduceRate { candidate_hash: CandidateHash, new_rate: Option<u32> },
// Chunks XORed
Chunk {
// which chunks are XORed together here, plus merkle proofs if the first time they were sent.
chunk_indices: Vec<(u16, Option<MerkleProof>)>,
lt_coded: Vec<u8>,
}
// other messages for negotiating higher bitrates, etc.
}
nodes would need logic to eventually stop providing chunks even if not asked and for sanity-checking the messages they receive. they could tend to begin by sending the systemic chunks.
This same code should be reusable within the backing phase for requesting from the collator as well as other backers.
cc @burdges
We do not require all chunks here, only the systematic ones, aka only the block itself plus a tiny bit of cruft.
We should recover systematic chunks when available since those avoid the decoding step. Although systematic chunks never avoid the reencoding step for approval checkers, there are other network participants like collators who might reconstruct, and who do not require reencoding, so they benefit massively if our code favors the systematic chunks optimization.
We'd some idea that approval checkers should not trash parablocks until after finalization anyways, since they're vulnerable in disputes. It's true, they could recover the block from the network, but saving it gives peace of mind or something. If this were true, then we've no additional storage requirements here.
We do not use fountain-codes anywhere outside Vault QR codes right now, just Reed-Solomn.
I do love this optimization of using approval checkers since it improves performance during disputes, assuming disputing nodes serve chunks too, but..
We should probably do this after the larger systematic chunks optimization, in part because it'll be dependent upon the systematic chunks, but also because systematic chunks might give a larger optimization: This cannot cover very low tranches numbers, with how low depending upon load.
@burdges For context, we're mostly discussing fast paths for getting data to approval checkers quickly. Fast paths are definitely useful, but we do need good mechanisms for backpressuring when the fast paths degrade and the network's throughput declines. The simplest and most effective backpressure would be for the relay chain to limit the incoming workload of new candidates based on finality lag, using finality proofs posted to the relay chain.
Even though systematic chunks are a step forward, we still would rather not fetch each chunk from a separate peer in the optimistic case.
Fountain codes for requesting the full data from multiple peers which have it (backers, for now, but possibly dedicated availability guarantors later, as well as other peers fetching the data or who have already checked in approvals/disputes) are basically what I'm proposing above.
And streaming of chunks in this way is definitely better done over UDP than TCP, though that'd require quite a lot of digging in the network code.
The simplest and most effective backpressure would be for the relay chain to limit the incoming workload of new candidates based on finality lag, using finality proofs posted to the relay chain.
Yes
Even though systematic chunks are a step forward, we still would rather not fetch each chunk from a separate peer in the optimistic case.
We'll do this just fine using systematic chunks: Alice can fetch chunks 1 through k form Bob and chunks k through n from Carol. BitTorrent improves latency by fetching disjoint chunks form multiple peers, over both TCP and UDP, but never uses erasure coding.
In principle, we'll improve upon BitTorrent, in that when you finally give up on someone giving you the last few systematic chunks, then you only need the missing number of non-systematic chunks.
Fountain codes for requesting the full data from multiple peers which have it (backers, for now, but possibly dedicated availability guarantors later, as well as other peers fetching the data or who have already checked in approvals/disputes) are basically what I'm proposing above.
I'd avoid fountain codes here. Afaik, they always have a poor undecodable ratio, which creates a DoS vector. Alice waits for ages while Eve streams her specially selected garbage. And almost everything Alice downloaded from Eve winds up being useless.
I originally proposed the fountain codes for Vault QR codes because you could avoid acknowledgements by using fountain codes, which matters for Vault QR codes. Yes, fountain codes are fast enough for interesting protocols over radio, and likely UDP, but..
I doubt fountain codes help us the way you describe..
And streaming of chunks in this way is definitely better done over UDP than TCP, though that'd require quite a lot of digging in the network code.
We do not use UDP ourselves, only TCP. We've failed so far to make QUIC as fast as TCP, because doing so is a major engineering effort, for which we do not hire enough networking talent.
We've acknowledgements in the TCP layer so I'm dubious that fountain codes could improve upon simply transmitting the data.
At present, there are no fountain codes in either QUIC or BitTorrent, so maybe UDP already winds up being reliable enough so that fountain codes never help? Or more precisely, there are fountain codes already being used, but in the physical internet hardware, so another layer does not really help much.
Anyways..
We should imho focus upon doing systematic chunks well.
After systematic chunks work, we should've some C person create a polkadot fork based upon libtorrent, which employs its prioritization features for the systematic chunks optimization. This is likely crazy fast. If so, we can choose how we proceed from there.
As some asides..
We could've an infrastructure prize for making a Polkadot testnet run well on QUIC, provided we could explain what we demand of its decentralization. It's hard to have a decentralized testnet, so not really sure how to write this prize.
I do think fountain codes might improve BitTorrent, but only within a mildly adversarial corner of BitTorrent's usually non-adversarial threat model. We could've a treasury RFP for outside academics to research using fountian codes within two threat models: BitTorrent with fake seeders, and roughly what you describe for Polkadot. This requires carefully analyzing the undecodable ratio attacks. In general, the adversary solves an NP hard problem here, so there is hope, but the adversary could employ approximation algorithms, maybe influences randomness, etc, so this is not easy.
I'd avoid fountain codes here. Afaik, they always have a poor undecodable ratio, which creates a DoS vector. Alice waits for ages while Eve streams her specially selected garbage. And almost everything Alice downloaded from Eve winds up being useless.
In my mind that's the purpose of this being a 'fast path'. Based on the probability of recovery after receiving K chunks we should be able to tell if someone is sending garbage and either switch to another peer to provide the data or fetch it single-chunk-wise (slow path). We'd additionally avoid requesting from that peer in the future. We could also easily determine if any single-chunk packets are invalid and cut off the peer as a provider immediately. If the request/response involves a negotiation over a seed for a PRNG used to sample the solition distribution, we should be able to determine relatively quickly whether a peer is sending us garbage.
Awful lot of complexity there, especially since peer miss-behavior could take several forms at the network level. We probably do not correctly detect peer miss-behavior now.
It's also unclear this busy us anything, especially since we're already in niche territory here.
We know after receiving one incomplete chunk with the RS systematic optimization. We typically build fountain codes to be systematic too, so initially they behave the same. We then ask: How does the system handle transmission errors?
TCP - No transmission errors, just ACKs, lower level reliability measures (codes), and rarely a dropped connection.
UDP - Some transmission failures, so either you rerequest like bittorrent, or ACK like QUIC, or else you stream extra via fountain codes. I bet bittorrent wins when downloading from many peers, but likely always since they simply have better code that libp2p. We don't have soo many approval checkers, but enough that streaming could reach 10% wasted bandwidth in the worst case.
It's likely bittorrent never carefully explored fountain codes, but I'd expect Google explored & rejected that option for QUIC. I'll ask Adam Langely.
Anyways..
All tranches benefit form the RS systematic optimization since each systematic chunk should be held by each backer, one availability provider, and then later approval checkers, so imho we should be hammered it down well first.
I generally agree that we should be able to go pretty far just on requesting systematic chunks, as long as we are smart about it.
We can also institute additional availability providers as you mention, where nodes may be selected during the availability phase to hold more than a single systematic chunk for a given candidate, and this mapping should be given deterministically.
If you envision data transport as being bittorrent, then polkadot has no trackers right now. All tracker information about who knows what is inferred form the gossip messages.
We should probably expose some real tracker-like interface for nodes who do not follow the chain, like multiple relay chains or between parachains. This creates a new threat model and solution, in which we'd want nodes to unify information from the trackers provided by a few random validators. I've not worked out the details.
Although systematic chunks never avoid the reencoding step for approval checkers, there are other network participants like collators who might reconstruct, and who do not require reencoding, so they benefit massively if our code favors the systematic chunks optimization.
Do collators not require re-encoding after they reconstruct? With the current code, they also reconstruct to check the erasure root, just like all the validators.
I guess it would suffice if they just checked the POV hash after recovering the data, since they don't have a say in the validator's approval-checking/disputes anyway. Is that right?
We've no reason for collators to do the reencoding check of the chunks merkle root. Approval checkers have already "proved" this correct, like they proved the block correct, from the perspective of later blocks, and to bridges by finality. Yes, collators should check the block hash of course, but not the chunks merkle root.
This is yet another hotpath for availability recovery that aims to reduce the pressure on the backing group. Instead of always trying backing group first (or if below a certain PoV size) we can fetch the whole
AvailableData
from an approval checker from which we've seen an approval vote on the candidate we intend to check. Doing so helps any of the non tranche0 checkers to have more options from where to fetch the full PoV instead of just the backing group.The implementation should append a new variant to the
RecoveryStrategy
enum and changeRequestFromBackers
to mix and shuffle both backers and approval checkers.A prerequisite for the implementation is that approval checkers must store the
AvailableData
in theav-store
. Doing this would inflate the storage of the validators as the new chunks will persist for 1 day. As this extra availability acts only as a cache and costs some storage, we should have a separate expiration mechanism for the chunks we keep after validation. Something simple like an expiration time of 15minute past finality should work fine.