paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.83k stars 668 forks source link

Poor PV performance (missed votes) with v1.12.0 and v1.13.0 #4800

Closed aperture-sandi closed 2 months ago

aperture-sandi commented 4 months ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

EDITED to add... Further research seems to indicate an overall increase in missed votes for a significant percentage of validators.
See comment below: https://github.com/paritytech/polkadot-sdk/issues/4800#issuecomment-2178910892

Poor PV performance (missed votes) with v1.12.0 and v1.13.0

We have been seeing poor PV performance in the form of missed votes for v1.12.0 and v1.13.0 For our DOT node for versions v1.11.0 and previous our missed vote ratio (MVR) was ~0.3% For our DOT node for versions v1.12.0 and newer our missed vote ratio (MVR) is ~1.2%

To confirm v1.12.0+ was the issue we reverted to v1.11.0 for 8+ sessions and saw our MVR return to previous low values.

We have tested this on Xeon(R) E-2136 CPU @ 3.30GHz and Xeon(R) E-2386G CPU @ 3.50GHz bare metal servers running NVME drives. NVME drive are connected to PCIE via add on card for direct PCIE access Hyperthreading is turned off (See attached benchmarks)

We compile using "RUSTFLAGS="-C target-cpu=native" and "--profile production" RUSTFLAGS="-C target-cpu=native" ~/.cargo/bin/cargo install --force --locked --path . --profile production

Rust Version:

~$ rustc --version
rustc 1.77.0 (aedd173a2 2024-03-17)

Ubuntu 22.04.3 LTS

See attached image of MVR via TurboFlakes ONE-T report. Live Report here: https://apps.turboflakes.io/?chain=polkadot#/validator/5F99nAGjKA9cU8JTpuTDWDhqpMxwxmwc8M1ojkKXZAR7Mn2c?mode=history

How can we help to narrow down the cause of this performance issue?

PV-MVR-061424

Steps to reproduce

To confirm v1.12.0+ was the issue we switch back to v1.11.0 for 8+ sessions and saw our MVR return to previous low values.

We have tested this on Xeon(R) E-2136 CPU @ 3.30GHz and Xeon(R) E-2386G CPU @ 3.50GHz bare metal servers running NVME drives.

EDITED to add... Further research seems to indicate an overall increase in missed votes for a significant percentage of validators.

See comment below: https://github.com/paritytech/polkadot-sdk/issues/4800#issuecomment-2178910892

bkchr commented 4 months ago

CC @sandreim @alexggh

alexggh commented 4 months ago

First of all the drop in performance is small, since you are still getting at least A grades, but not that often getting A+, with these small margins I would double check a few times that it is not just restarts that improve things and not the version changing.

Getting into potential reason for this happening, I don't think it is a problem with those releases since ~90% of the validators is already running them and getting A+ grades, so I think it might be a combination of network or your node being a bit slow and maybe something that change between those versions. Two pull requests touched that area(statement-distribution) between v1.11 and v1.12:

aperture-sandi commented 3 months ago

Thanks for helping out with this. I will address the possible issues with our servers and network below.

that it is not just restarts that improve things

For versions 1.11.0 and previous, our nodes for ran for weeks at a time without restart and achieved very few missed votes.

During the "v1.12.0 section of our graph above (snip below), we restarted no less than four times as we moved the node back and forth between our E-2136 and E-2386G severs. Restarts showed no improvement.

image

a combination of network (or your node) being a bit slow

To keep network latency to a minimum we use 16 core Microtik routers capable of handling 100x the traffic we throw at it. Our routers' CPU usage is almost idle.

image

your node being a bit slow

We thought our E-2136 server might be a bit slow (despite it meeting benchmarks and running v1.11.0 fine), so we did switch the DOT node to the E-2386G server that achieves a score of over 200% on all tests except memory. As discussed above, the faster server was no help.

image

aperture-sandi commented 3 months ago

Adding more data to this to show that this appears to be effecting other validators...

This is a chart showing network DOT performance for v1.11.0. Note the "Distribution by grade" section, percentage of A+ validators was 89.5% eras 1445 1448

Now, this is a chart showing network DOT performance for v1.12.0+. Note the "Distribution by grade" section, percentage of A+ validators is 71.5% eras 1449 1478

ONE-T data only goes back 30 days, so data for v1.11.0 is now limited but seems to indicate an overall increase in missed votes.

eskimor commented 3 months ago

Can someone check whether parachain block times also take a hit? Maybe some parachains are forking?

sandreim commented 3 months ago

Two pull requests touched that area(statement-distribution) between v1.11 and v1.12:

Could be it, currently there is a hard limit of 1 concurrent request per peer: https://github.com/paritytech/polkadot-sdk/pull/3444#discussion_r1502521881

I would expect that some validators are not upgraded and send more than one request at a time to peers upgraded to v1.12+

sandreim commented 3 months ago

Can someone check whether parachain block times also take a hit? Maybe some parachains are forking?

Block times for past 90days

Screenshot 2024-06-19 at 18 51 53
sandreim commented 3 months ago

I don't have any metrics for Polkadot as we dont have any validators running there but the metrics for Kusama show a clear trend of dropping statement distribution requests. It seems to be better than it was before as more have upgraded.

However I don't see this issue affecting liveness at all.

Screenshot 2024-06-19 at 19 15 12
alexggh commented 3 months ago

Adding more data to this to show that this appears to be effecting other validators...

Digging a bit more through the data and I concur with your finding it seems like around session 8.736 -> 21-22th of May the number of validators with A grade instead of A+ increased from 6% to 22%, that also coincides with the day release v1.12 was announced.

Can someone check whether parachain block times also take a hit? Maybe some parachains are forking?

Doesn't look like from the metrics and it makes sense because https://apps.turboflakes.io/?chain=polkadot#/insights gives you the grade based on the relative points you gathered compared with the Peers in your group, so in this case if other peers got slightly more points it means they backed the candidate and that ended-up on chain.

Just from the data it looks like, sometimes you loose the race of including your Valid/Backed statements on chain, as of now the most obvious candidate for causing this is: https://github.com/paritytech/polkadot-sdk/issues/4800#issuecomment-2179020389, but this needs more investigation to understand the full conditions.

alexggh commented 3 months ago

@aperture-sandi any chance you could provide us with the grafana values for the following metrics, when running v1.12 or v1.13 ?

avg(rate(polkadot_parachain_statement_distribution_peer_rate_limit_request_drop_total{}[$__rate_interval]))
avg(rate(polkadot_parachain_statement_distribution_max_parallel_requests_reached_total{}[$__rate_interval]))
sandreim commented 3 months ago

Quick update on next steps here:

alexggh commented 3 months ago

Looking at the logs from one of our kusama validator I see this: https://grafana.teleport.parity.io/goto/7QCJDgwIR?orgId=1


2024-06-21 16:27:55.766 2024-06-21 13:27:55.757 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=40004
2024-06-21 15:27:55.947 2024-06-21 12:27:55.942 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=40003
2024-06-21 14:29:24.594 2024-06-21 11:29:24.594 DEBUG tokio-runtime-worker parachain::availability-store: Candidate included candidate_hash=0xc6d10156031899e715e3a8afa669d970bfb166290921b761cbed5857a3edf4e3 traceID=264272360305836897721701964816212154736
2024-06-21 14:29:24.541 2024-06-21 11:29:24.534 DEBUG tokio-runtime-worker parachain::availability-store: Candidate included candidate_hash=0xc6d10156031899e715e3a8afa669d970bfb166290921b761cbed5857a3edf4e3 traceID=264272360305836897721701964816212154736
2024-06-21 14:29:12.965 2024-06-21 11:29:12.964 DEBUG tokio-runtime-worker parachain::candidate-backing: Refusing to second candidate at leaf. Is not a potential member. candidate_hash=0xc6d10156031899e715e3a8afa669d970bfb166290921b761cbed5857a3edf4e3 leaf_hash=0xa830b4842c1db6ed4c5df3ecebc4854f20e9835662257498eed0cbb07b4ce2d8 traceID=264272360305836897721701964816212154736
2024-06-21 14:27:56.147 2024-06-21 11:27:56.139 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=40002
2024-06-21 13:34:24.413 2024-06-21 10:34:24.412 DEBUG tokio-runtime-worker parachain::availability-store: Candidate included candidate_hash=0x23132e882af4e3a044deb39fa876e1d469aa653406e34ce2dc74214fc8d2f085 traceID=46622577271950510439480467781380923860
2024-06-21 13:34:13.174 2024-06-21 10:34:13.161 DEBUG tokio-runtime-worker parachain::candidate-backing: Refusing to second candidate at leaf. Is not a potential member. candidate_hash=0x23132e882af4e3a044deb39fa876e1d469aa653406e34ce2dc74214fc8d2f085 leaf_hash=0xd362726d4c4f4e1d5f4fbd8518508df453c26e314465a336837a050c73ce9d5c traceID=46622577271950510439480467781380923860
2024-06-21 13:32:42.752 2024-06-21 10:32:42.751 DEBUG tokio-runtime-worker parachain::availability-store: Candidate included candidate_hash=0xc1da83cf1eb21ca84ad230063e99fc6942892a9904e2488d00883330eca4cf10 traceID=257675597307036949136235535082798578793
2024-06-21 13:32:36.126 2024-06-21 10:32:36.076 DEBUG tokio-runtime-worker parachain::candidate-backing: Refusing to second candidate at leaf. Is not a potential member. candidate_hash=0xc1da83cf1eb21ca84ad230063e99fc6942892a9904e2488d00883330eca4cf10 leaf_hash=0xf8985bf5e6dcf2afa8630b6f4341f65a568392b2d1ef66daff3e0c7be7ba8182 traceID=257675597307036949136235535082798578793
2024-06-21 13:27:55.473 2024-06-21 10:27:55.470 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=40001
2024-06-21 12:27:55.649 2024-06-21 09:27:55.643 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=40000
2024-06-21 11:27:55.700 2024-06-21 08:27:55.692 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=39999

And this correlates exactly with the number of missed statements by the validator: https://apps.turboflakes.io/?chain=kusama#/validator/FSkbnjrAgD26m44uJCyvaBvJhGgPbCAifQJYb8yBYz8PZNA?mode=history. It misses 2 statements at session 40001 and 1 statement at 4002.

The changes have been introduced with https://github.com/paritytech/polkadot-sdk/pull/4035, so most likely that's what is causing the behaviour on your validators, I'll dig deeper into why this condition Refusing to second candidate at leaf. Is not a potential member gets triggered.

The good news is that this is happening very rarely and even when it happens the candidate gets backed anyway, so the parachain does not misses any slot, it's just that the validator will not get the backing points for that single parachain block.

alindima commented 3 months ago

ok I was a bit confused by the logs being in reverse chronological order.

This log can appear if the parachain is producing forks. But the resulting number of on-chain backing votes should not be different prior to #4035 .

Prior to #4035, different parts of the backing group (let's say half and half) could back different forks. The block author would in the end only pick one of the two (so only half of the votes would end up on chain). But the block author will have both candidates to pick from.

After #4035, each half of the validator group would back different forks, but the block author will only ever know one candidate. In the end, th result should be the same (only half of the votes would end up on chain).

Based on this reasoning this should not affect backing rewards.

However, I think what may be happening is that prior to #4035, each validator in the group would have backed both candidates. That is, even if a validator voted for a candidate, this wouldn't stop the validator from voting for a competing candidate.

aperture-sandi commented 3 months ago

@aperture-sandi any chance you could provide us with the grafana values for the following metrics, when running v1.12 or v1.13 ?

avg(rate(polkadot_parachain_statement_distribution_peer_rate_limit_request_drop_total{}[$__rate_interval]))
avg(rate(polkadot_parachain_statement_distribution_max_parallel_requests_reached_total{}[$__rate_interval]))

Hi all, I'm out of the office until Tuesday won't be able to get to this until Tuesday, Wednesday, if still needed

sandreim commented 3 months ago

The changes have been introduced with #4035, so most likely that's what is causing the behaviour on your validators, I'll dig deeper into why this condition Refusing to second candidate at leaf. Is not a potential member gets triggered.

The good news is that this is happening very rarely and even when it happens the candidate gets backed anyway, so the parachain does not misses any slot, it's just that the validator will not get the backing points for that single parachain block.

I discussed this further with @alindima offline and we think you are right.

Prior to #4035, different parts of the backing group (let's say half and half) could back different forks. The block author would in the end only pick one of the two (so only half of the votes would end up on chain). But the block author will have both candidates to pick from.

This is almost correct. You would have more than half of the votes, as the validators would accept to second the fork which is what changed with this PR.

eskimor commented 3 months ago

A simple fix would be to ramp up the backing threshold to 3 out of 5 on Polkadot. Better because always working: Deterministic fork selection: Instead of dropping the new import, we compare by CandidateHash and drop the smaller/greater one. If we do this, then please don't forget to add this to the implementer's guide.

alexggh commented 3 months ago

Double checked: 0xc1da83cf1eb21ca84ad230063e99fc6942892a9904e2488d00883330eca4cf10 and 0x23132e882af4e3a044deb39fa876e1d469aa653406e34ce2dc74214fc8d2f085 and indeed, there seems to be a fork in both situations, I see two different candidates with the same relay-parent, and our validator validates/seconds the ones that are on the losing part of the fork. These are asset-hub and bridge-hub blocks, so it is not clear to me if a fork is expected there, I would have expected because they run aura to not have forks at all.

A simple fix would be to ramp up the backing threshold to 3 out of 5 on Polkadot

How would that fix this situation ? If you are on the loosing side of the fork your vote would still not be counted and rewarded.

burdges commented 3 months ago

Yeah, aura should not fork except from significant gossip delays, wrong clocks, or other concensus-ish issues.

Are there equivocations on the parachain? Aura equivocations should look like two blocks with the same slot, which incidentally have the same block producer authority.

A simple fix would be to ramp up the backing threshold to 3 out of 5 on Polkadot

At minimum, we should glance at those backing group size charts first, because 2-of-5 chooses 2<5/2 for liveness. I'd think 2-of-4 retains liveness, maybe it reduces this too, but likely the rel fix lies elsewhere.

alexggh commented 3 months ago

Are there equivocations on the parachain? Aura equivocations should look like two blocks with the same slot, which incidentally have the same block producer authority.

These are the two blocks I'm talking about: [Good] https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc-bridge-hub-kusama.luckyfriday.io#/explorer/query/0x39e375a33b5b3e345b32a71c6597f8c956aa2eec1ea857489012ae00a1e1295c

[Abandoned but our validator backs it first] https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc-bridge-hub-kusama.luckyfriday.io#/explorer/query/0x0ee1366da44b3513b8e987edb00727c224fa12f0c842871e38285d4bf4c60c58

Looking at the logs they don't seem to be produced by the same validator:

The good one:

relay_parent: 0xf8985bf5e6dcf2afa8630b6f4341f65a568392b2d1ef66daff3e0c7be7ba8182, collator: Public(cc5c5ceaf02e204217eb1d9907318fd9827772923b72e2ca4b583d13dec74460 (HCGiw6p2...))

The abandoned:
relay_parent: 0xf8985bf5e6dcf2afa8630b6f4341f65a568392b2d1ef66daff3e0c7be7ba8182, collator: Public(783689d3e497cbca5397e0444a071e7e2f82f6faf6f6ac0f3ea19caaf7650b66 (FHwTwJjS...))
alexggh commented 3 months ago

Found the root-cause of the forks: https://matrix.to/#/!gTDxPnCbETlSLiqWfc:matrix.org/$fO-SigvBAVe1uRY3Kp99yGeUDiHPvfeaL_fCR_tKyoc?via=web3.foundation&via=matrix.org&via=parity.io, it seems that people that run system parachains are running two different instances of a collator with the same authorithies keys, so when it is the turn of their authority they end-up proposing two blocks which end up racing on the relay-chain to get the necessary backing votes.

So, if a validator is backing the loosing fork it will not get the rewards, because of the reasons explained here by @alindima https://github.com/paritytech/polkadot-sdk/issues/4800#issuecomment-2182882971.

My suggestion for the path forward here would be:

  1. Short-term, ask system-parachain collators to not run 2 instances of collators with the same authority keys, I do think this is a wrong setup and aura wasn't designed to work like this. FYI @eskimor @sandreim @bkchr @skunert, maybe you've got some context why this is a good/bad idea.

  2. Medium-term, the relay-chain backing protocol should be fixed, so that even in the presence of parachain forks validators should get even rewards and not depend on the timing of the candidates getting to them.

bkchr commented 3 months ago
  1. Short-term, ask system-parachain collators to not run 2 instances of collators with the same authority keys, I do think this is a wrong setup and aura wasn't designed to work like this. FYI @eskimor @sandreim @bkchr @skunert, maybe you've got some context why this is a good/bad idea.

We should start slashing them. Initially I didn't added slashing for equivocations, because there was no way to determine the relay chain block associated to a parachain block. However, now we can do this and we should enable it.

Generally the idea is that there is one block per slot from each author.

bkchr commented 3 months ago

Medium-term, the relay-chain backing protocol should be fixed, so that even in the presence of parachain forks validators should get even rewards and not depend on the timing of the candidates getting to them.

Sounds complicated to me, as a validator could start signing any kind of blob and pretend they voted for something.

alexggh commented 3 months ago
  1. Short-term, ask system-parachain collators to not run 2 instances of collators with the same authority keys, I do think this is a wrong setup and aura wasn't designed to work like this. FYI @eskimor @sandreim @bkchr @skunert, maybe you've got some context why this is a good/bad idea.

We should start slashing them. Initially I didn't added slashing for equivocations, because there was no way to determine the relay chain block associated to a parachain block. However, now we can do this and we should enable it.

Generally the idea is that there is one block per slot from each author.

They are literally, running two instances with the same keys, because they think it is good practice, I will tell them to stop this practice for system parachains, since it does more harm than help.

sandreim commented 3 months ago
  1. Short-term, ask system-parachain collators to not run 2 instances of collators with the same authority keys, I do think this is a wrong setup and aura wasn't designed to work like this. FYI @eskimor @sandreim @bkchr @skunert, maybe you've got some context why this is a good/bad idea.

We should start slashing them. Initially I didn't added slashing for equivocations, because there was no way to determine the relay chain block associated to a parachain block. However, now we can do this and we should enable it.

Generally the idea is that there is one block per slot from each author.

+1 creating forks will impact parachain block time, so there is a good incentive for parachains to slash their collators if they do it.

On the other hand, theoretically you could even have proof of work parachains or some custom block production consensus that creates forks. In such a scenario, if your validator is assigned to a such a parachain it will get less rewards than before, so problem is not solved in the general case.

With the current behaviour of never backing a fork we actually save some CPU that could be used for more useful work. Of course, we trade off some liveness when a fork gets backed with only 2 backers that have bandwidth issues or go offline.

IMO the solution seems to be changing prospective parachains to accept forks. This is limited to just ensure we have as many backers as possible, but the forks would never be taken into consideration to form a chain, so we keep current behaviour and make validators happy again.

bkchr commented 3 months ago

With the current behaviour of never backing a fork we actually save some CPU that could be used for more useful work. Of course, we trade off some liveness when a fork gets backed with only 2 backers that have bandwidth issues or go offline.

We are backing forks? The point being that you may only back a fork that doesn't get on chain and thus, you don't get any rewards as explained by @alexggh above.

Generally I would also say that this is expected. Over time the validator will get its normalized rewards. Also don't we already give more era points for approval voting than backing? So, backing a fork should even have less impact.

burdges commented 3 months ago

@alexggh You've different collator: Public(.. fields above, so those were not the authority key for craeting the block?

It's harmless for polkadot if the fork gets back, just some lost rewards for the backer. We do not want collators equivocating of course, but we should first ask them to stop. If we then need non-slashing technical measures, then we could break out my CRDT trick authority key registration, aka only one registered on-chain at a time. We'd then only have real dishonest equivocations from collators, for which parachains likely eject them.

Also, we're not even paying the approval checkers yet, so our backing rewards look really exagerated right now. Just naively, validator rewards should break down roughly like 70% approval checks, 8% availability, 7% backing, and 15% babe & grandpa. We're missing like 78% of that allocation, so all current rewards shall shrink by a factor of 5 once we account for all this critical work, and then people should care less if they back the wrong thing ocasionally.

eskimor commented 3 months ago

Great work @alexggh ! Indeed, the fork choice already happens before a candidate is fully backed (actually not good), so 3 out of 5 does not help here. Picking by comparing candidate hashes should do the trick though.

alexggh commented 3 months ago

@alexggh You've different collator: Public(.. fields above, so those were not the authority key for craeting the block?

Yeah that was a red-herring, those weren't the authority ID, they are some keys that are auto-generated at startup for each instance, so that's why I thought these are different collators, but they aren't.

Generally I would also say that this is expected. Over time the validator will get its normalized rewards.

Yes, I think that it is what will happen when the forks would be more or less random, but in this case it looks like the same set of validators find themselves more often on the loosing side, I would assume because the fork is deterministic and the same validator are loosing because of their latency to the collator.

Also don't we already give more era points for approval voting than backing

We don't have approval voting rewards implemented yet, but yeah once we reduce the backing proportion in rewards this should actually be a non-issue, even currently the difference in points it is just around 1%.

alexggh commented 3 months ago

Asked the system parachain collators to stop running two instances with the same authority keys here: https://matrix.to/#/!gTDxPnCbETlSLiqWfc:matrix.org/$nP852NAScRgGukKtdNHoDEISH_XaqnfgXX5MoSPZy1Q?via=web3.foundation&via=matrix.org&via=parity.io, I'll keep an eye on the dashboards to confirmed if there is any impact.

alindima commented 3 months ago

I've discussed this issue more with @eskimor and @sandreim and we came up with the following solution that we think would work.

In #4035 , I removed support for parachain forks in prospective parachains, but added support for unconnected candidates (candidates for which we don't yet know the parent, so that chained candidates can be validated in parallel by different backing groups, needed for elastic scaling). A candidate is currently inserted into prospective parachains when we see a Seconded statement.

The proposal is to have prospective-parachains only build the chain with candidates that have been backed. This way, a single validator that seconds an invalid candidate would not prevent a legitimate fork from being backed. If multiple candidates are backed at the same height, use a deterministic, random and simple way of choosing the best fork (choosing the lower candidate hash for example). This way, we'd ensure that all validators will back the same fork (therefore they'll likely all get the backing reward).

The unconnected candidate storage could hold forks, but the actual chain prospective-parachains builds would not allow for forks (which was the source of great complexity before #4035)

TL;DR: prospective-parachains will again allow for forks between candidates that are not backed but will not track them once the candidates are backed. validators will use a deterministic fork selection rule during backing

This is definitely not a trivial fix that can be done in a couple of days though.

bkchr commented 3 months ago

validators will use a deterministic fork selection rule during backing

Can you explain what you mean by this?

alindima commented 3 months ago

validators will use a deterministic fork selection rule during backing

Can you explain what you mean by this?

I mean that if there is a fork (two different candidates are seconded, having the same parent), the other validators in the backing group will all choose to validate the same candidate, so that they all have the highest chance of getting their backing reward. The block author will also pick the same candidate to back on-chain between the two. This rule could be that we favour the candidate with the lower candidate hash. This is something random that can't be easily manipulated.

alindima commented 3 months ago

Actually, now that I've started prototyping this I'm not sure it's needed to do this comparison during backing, as long as we allow both forks to be backed. It's up to the block author to pick one and both of them could have the same number of votes.

It'd only be more efficient to give up validation of a candidate that we could reasonably know will not be picked by the block author. But adds a whole lot of complexity for little benefit the more I think about it

bkchr commented 3 months ago

It's up to the block author to pick one and both of them could have the same number of votes.

Yeah this was actually the case before the changes you have done in #4035. I mean the general idea of choosing the next block to back, based on the lower hash sounds reasonable to me. I mean there could be N (number of validators in the backing group) forks and to ensure that one gets the required number of votes, by letting all decide in the same way sounds like a good idea. But then we would maybe need to stop fetching a POV when we see some smaller hash or whatever. So, it adds complexity again. Maybe something when we are bored :P

sandreim commented 3 months ago

Actually, now that I've started prototyping this I'm not sure it's needed to do this comparison during backing, as long as we allow both forks to be backed. It's up to the block author to pick one and both of them could have the same number of votes.

It'd only be more efficient to give up validation of a candidate that we could reasonably know will not be picked by the block author. But adds a whole lot of complexity for little benefit the more I think about it

IIUC this means we need to go back to tracking trees rather than chains + unconnected candidates. The idea with the comparison is to use it so only the chosen fork makes it to the block author (unless it is himself in that group). All other forks remains local to the backing group (cluster topology).

burdges commented 3 months ago

I'm not sure it's needed to do this comparison during backing, as long as we allow both forks to be backed. It's up to the block author to pick one and both of them could have the same number of votes.

This was always the idea. We've no control over the order in which backers & block authors recieve the candidates & statements, and thus cannot reward every backed candidate, because establishing control costs too much CPU time & bandwidth. That's fine.