paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.78k stars 644 forks source link

`approval-voting`: merge tranche 0 assignments #729

Open sandreim opened 1 year ago

sandreim commented 1 year ago

As pointed out in https://github.com/paritytech/polkadot-sdk/issues/731 we can save a lot of CPU and reduce gossip by doing only one RelayVrfModulo signature per authority and derive multiple assignments from the output which stacks with batch verification.

What it is not clear at least for me is if we can also merge Delay assignments per tranche for the same signer by signing once for multiple candidates. Depending on the distribution this can also save us CPU when importing and reduce the network load.

Another question is why would we need authorities to refuse some assignments ? Is this for being able to control the amount of work to be done ?

CC @burdges @eskimor

burdges commented 1 year ago

Importantly, we'll only consolidate or merge the RelayVrfModulo checks this way. RelayVrfDelay checks should each remain independent, but RelayVrfDelay should be only maybe 15% of the checks under normal conditions.

IIUC in our case we have a single signer and this would mean that we could batch it's own RelayVrfDelay assignments for the same tranche (different candidates). If my understanding correct ?

A batch verification does not require the same signer, but it only saves like 40%.

Can you detail a bit the pros and cons of having RelayVrfModulo represent 85% of assignments in tranche 0?

No cons.

RelayVrfModulo has a nicer distribution than RelayVrfDelay because individual validators have exactly k assignments from RelayVrfModulo, which avoids some validators having too many tranche zero assignments.

We could've easily made RelayVrfModulo be 100% of the tranche zero checks, but never did so. RelayVrfModulo should already be like 95% of the tranche zero checks, assuming we picked parameters correctly, so maybe like 85% of the total checks under normal conditions.


We should probably have each validator send one RelayVrfModulo assignment per relay parent, which includes a bitfield of which assignments they'll take vs reject. Ain't clear if batch verifying these is worthwhile, likely batch verifying grandpa yields more.

As RelayVrfDelay becomes prevalent if many validators go offline, we could optionally batch verify RelayVrfDelay as a premature optimization against our next chain stall, but not sure if this 40% really matters or if we're better off doing other things.

burdges commented 1 year ago

After verifying the VRFInOut in the signature, the output routine maybe looks vaguely like:

/// A hard upper bound on num_cores * target_checkers / num_validators
const MAX_MODULO_SAMPLES: usize = 32;

/// Iterates over all parachains which we check in tranche zero
fn relay_vrf_modulo(io: VRFInOut, num_samples: usize, max_cores: usize) -> impl Iterator<Item=u16> {
    vrf_io.make_bytes::<[u8; 2*MAX_MODULO_SAMPLES]>(b"RelayVrfModuloIO")
    .chunks_exact(2)
    .take(num_samples)
    .map(|x| u16::from_le_bytes( array_ref![x,0,2] ) % max_cores )
}
rphmeier commented 1 year ago

Isn't it a problem to derive all RelayVrfDelay assignments from a single VRF Output, because then publishing one assignment leaks all the tranches for other candidates you're assigned to? RelayVrfModulo doesn't have this issue because all assignments are tranche zero. I also don't think that nodes should be allowed to silently "not take" assignments, at least at the moment. If so, then we'd need a way to update the network as to which assignments are being taken.

burdges commented 1 year ago

Yes exactly, RelayVrfModulo could all be derived from one output, but the RelayVrfDelay in different tranches must stay independent.

I also don't think that nodes should be allowed to silently "not take" assignments, at least at the moment.

It's better to reject, or not announce, an assignment than to announce it and then no show. I've no idea if we'll ever need this, but I made the RelayVrfModulo independent purely to give us this option, which now looks like a mistake.