Unintended Consequences from Leader Race

brianlong commented 3 years ago

Problem

The race between the current leader "L0" and the next leader in line "L1", described in https://github.com/solana-labs/solana/blob/22a18a68e3ee68ae013d647e62e12128433d7230/docs/src/implemented-proposals/leader-leader-transition.md, is causing skipped slots. See the "Supporting Information" section below for a view of four skipped slots from the perspective of three different validators.

I believe that the race is intended to shorten the gaps in transaction processing when L0 is offline. Instead, with slow network propagation, we see situations where L1 creates a new fork for their 4 slots and then loses out when the voters vote for L0. In cases like this, the ONLY WAY to resolve the conflict is to skip 4 slots. Ironically, the policy intended to minimize the impact of skipped slots is actually causing them.

I see an inconsistency between the relative patience of L1 and the voters. L1 is acting in an impatient manner by jumping ahead to create a new fork. In contrast, the voters will patiently wait for L0's work to arrive before voting. We are asking/begging for skipped slots/forks when L1 and the voters are working from a different set of rules.

Supporting Information

Thanks to @nyet | Blockdaemon & @Zantetsu | Shinobi Systems for sharing logs. I see some inconsistent behavior related to event timing between the leader and the validators. Here are some details for discussion. Relevant slots & leaders:

84414591        93P6 located in Ogden, UT
84414592        FzPo located in Tokyo
...
84414596        DDnA (me) located in Frankfurt, Germany

For all three of us (Block Daemon, BlockLogic, & Shinobi), the transition from 591 => 592 looked normal. Here are some events from Zantetsu's log (timestamps are seconds):

You'll notice that DDnA started to send shreds for 596 shortly after FzPo sent shreds for 592.

After both banks were frozen, Shinobi voted on 592 about 2.26 seconds after the first shreds arrived and 2.46 seconds after voting for 591. Keep those times in mind as we go along. I think they matter.

For Block Daemon, we also see a smooth handoff from 591 => 592. They vote for 592 about 1.30 seconds after receiving the first shred and 1.74 seconds after voting for 591.

On my server, the hand-off from 591-592 took a little longer (1.14 seconds). However, it looks like DDnA didn't wait for a reasonable grace period before starting a new fork. The fork for 596 was created 0.32 seconds after the first shred from 592 and 1.47 seconds after voting for 591. The frozen bank for 592 was delivered 1.02 seconds after the first shred and 2.17 seconds after voting for 591.

So, here is where I see the inconsistent behavior, Shinobi was comfortable waiting over 2 seconds before voting on 592. Yet, DDnA created a new fork less than 1.5 seconds after voting on 591.

If DDnA had waited a similar period, it would have seen the frozen bank for 592 and NOT created the fork at 596.

Proposed Solution Ideas

To resolve the conflict, we need L1 and the voters to share the same behavior. Either the cluster is patient or impatient, not both at the same time. Here are a couple of solution ideas. One solution for a patient cluster and another for the impatient fast-twitch crowd.

1) Patient Cluster: L1 should only jump ahead when L0 is known to be delinquent. Otherwise, wait for the shreds to arrive and proceed normally. Please note that L0 in the example above is located in Tokyo and it will have a longer propagation time to most of the nodes in the cluster. The cluster will be much more tolerant of network propagation delays. The cluster will also have higher transactional throughput because we are not skipping as many slots. (There will still be a maximum wait period to avoid stalls. Both L1 and the voters will use the same wait period.)

2) The Impatient Cluster: The behavior of the voters will be consistent with L1, not L0. The voters will recognize that L0 was slow to send data, and will prefer to vote for L1 in these cases. This solution may have a further unintended consequence of forcing nodes out of locations like Tokyo. We can't be too impatient if we want to see validator nodes all over the world. This solution also means that we are allowing more skipped slots which lowers the transactional throughput of the cluster. Sometimes faster isn't actually faster.

We'll need to debate the Patient v. Impatient cluster before arriving at a final solution. Once decided, the path forward will be clear.

brianlong commented 3 years ago

I didn't discuss network topology above, but we need to take into account that the network delay for any leader's first slot will always be higher than slots 2-4. In the example above, the data for slot 591 traveled straight-ish from Ogden, UT to my node in Frankfurt, DE because that was 93P6's 4th slot.

For slot 592, I need to wait for the final shred of 591 to be sent from Ogden, UT to Tokyo, JP before FzPo can start building 592. Then the first shred of 592 is sent from Tokyo to Frankfurt. For slots 593-595, we don't have the extra hop because FzPo is building on its' own blocks.

For every leader transition, we need to insert NN milliseconds of 'sync time' when waiting for the first shred of the new leader's first slot.

For anyone who has rowed in an 8-person rowing shell, you will know intuitively how this works. There is a slight pause near 'the catch' for all of the rowers to get in sync before the next power stroke.

The rhythm of the cluster looks like 1-2-3-4-sync-1-2-3-4-sync-1-2-3-4 and so on.

jon-chuang commented 3 years ago

This is a related issue: https://github.com/solana-labs/solana/issues/18044 I suggested that next leader should be more patient if they have high ping with previous leader. Just one idea.

ryoqun commented 3 years ago

wow. what a cool thing is that we got a well-reasoned bug report even with experiments from our community. I'm proud of it.

Recently, I was bothered about skip rates in testnet as well. I was observing a similar behavior like this there and thinking about the patient leaders, too.

@behzadnouri @carllin summoning consensus expert and gossip/turbine expert :)

sakridge commented 3 years ago

@brianlong Do we have the full logs from this?

So, here is where I see the inconsistent behavior, Shinobi was comfortable waiting over 2 seconds before voting on 592. Yet, DDnA created a new fork less than 1.5 seconds after voting on 591.

If DDnA had waited a similar period, it would have seen the frozen bank for 592 and NOT created the fork at 596.

I wouldn't describe Shinobi waiting 2 seconds before voting as comfortable. It waited until it had actually completed the block and frozen it. So it didn't wait any longer than it had to assuming it wants to know the block is actually complete before voting on it.

This is different than waiting to start your leader block which happens when the node thinks it waited the appropriate amount of time to receive blocks from previous leaders.

brianlong commented 3 years ago

logs_84414596_shinobi.txt.zip

brianlong commented 3 years ago

logs_84414596_block_daemon.txt.zip

brianlong commented 3 years ago

logs_84414596_bl-fra.txt.zip

brianlong commented 3 years ago

Hi @sakridge, thanks for looking into this. See logs posted above. I can also upload a bunch of logs from non-voting RPC servers if that will be helpful.

"This is different than waiting to start your leader block which happens when the node thinks it waited the appropriate amount of time to receive blocks from previous leaders."

This is the point that I'm making. There are different concepts for the "appropriate amount of time." L1 seemed to give up waiting on L0 much earlier than Shinobi or Block Daemon did when voting. When L1 creates the fork, we are guaranteed to skip 4 slots from either L0 or L1.

When I look at solana block-production --verbose, I see that most of the skipped slots are in groups of four like the pattern above. This looks like a common problem and we can greatly improve cluster throughput by fixing it.

sakridge commented 3 years ago

Hi @sakridge, thanks for looking into this. See logs posted above. I can also upload a bunch of logs from non-voting RPC servers if that will be helpful.

"This is different than waiting to start your leader block which happens when the node thinks it waited the appropriate amount of time to receive blocks from previous leaders."

This is the point that I'm making. There are different concepts for the "appropriate amount of time." L1 seemed to give up waiting on L0 much earlier than Shinobi or Block Daemon did when voting. When L1 creates the fork, we are guaranteed to skip 4 slots from either L0 or L1.

When I look at solana block-production --verbose, I see that most of the skipped slots are in groups of four like the pattern above. This looks like a common problem and we can greatly improve cluster throughput by fixing it.

. There are different concepts for the "appropriate amount of time."

There's not. That 2 seconds to vote on 592 wasn't baked into the validator code anywhere. It wasn't "waiting" anymore than all the validator does is wait for blocks all the time and vote on them as fast as it can. So that was purely determined by when all the other validators managed to send the Shinobi node the block.

If it had had access to better blocks earlier, then it would have voted on those instead.

brianlong commented 3 years ago

@sakridge Why didn't DDnA also follow the same logic?

sakridge commented 3 years ago

@sakridge Why didn't DDnA also follow the same logic?

In which operation?

brianlong commented 3 years ago

When it created the fork for 596. Shreds for 592 were streaming in at that time, and we were still within the window of time for L0's 4 slots. L1 (DDnA) jumped the gun

sakridge commented 3 years ago

When it created the fork for 596. Shreds for 592 were streaming in at that time, and we were still within the window of time for L0's 4 slots. L1 (DDnA) jumped the gun

592 window is the first 400ms after 591. DDnA waited 4 slots (1.4s) just to receive 592 and it doesn't know if it would even complete the block or not.

brianlong commented 3 years ago

And this is where the "Unintended Consequences from Leader Race" come into the conversation. What is the benefit of the leader race? We are currently skipping 23.724% of slots and most of those are coming in groups of 4. The race is clearly causing skipped slots. Does the benefit outweigh the cost?

brianlong commented 3 years ago

Delinquency on mainnet is usually quite low -- currently 0.16%. The current skip rate is not from delinquent nodes.

brianlong commented 3 years ago

There is another data point that stands out to me. The duration of 596 on each of the machines.

Block Logic (Frankfurt, DE): 0.574 from new fork to vote Block Daemon (London, GB): 0.889 from the first shred to frozen bank Shinobi (Vancouver, CA): 1.794 from the first shred to frozen bank

I am assuming that the network latency from Block Logic to the other two nodes is neutralized because I'm only looking at the local duration on each node.

That's a huge gap between Shinobi & the EU validators. I wonder how much of that time was lost in Turbine/Repair? The leader for 592 was also a great distance away in Tokyo. I realize that is only a single data point, but it's worth looking into. I'll review the logs again.

The unintended consequences I described above might be exacerbated by messy shred transmissions. Improvements there might make this less of a problem...

sakridge commented 3 years ago

There is another data point that stands out to me. The duration of 596 on each of the machines.

Block Logic (Frankfurt, DE): 0.574 from new fork to vote Block Daemon (London, GB): 0.889 from the first shred to frozen bank Shinobi (Vancouver, CA): 1.794 from the first shred to frozen bank

I am assuming that the network latency from Block Logic to the other two nodes is neutralized because I'm only looking at the local duration on each node.

That's a huge gap between Shinobi & the EU validators. I wonder how much of that time was lost in Turbine/Repair? The leader for 592 was also a great distance away in Tokyo. I realize that is only a single data point, but it's worth looking into. I'll review the logs again.

The unintended consequences I described above might be exacerbated by messy shred transmissions. Improvements there might make this less of a problem...

The transmission time for shinobi is this: [2021-06-25T20:18:56.183733786Z INFO solana_metrics::metrics] datapoint: retransmit-first-shred slot=84414596i [2021-06-25T20:18:56.705423772Z INFO solana_metrics::metrics] datapoint: shred_insert_is_full total_time_ms=746i slot=84414596i [2021-06-25T20:18:57.977454866Z INFO solana_core::replay_stage] bank frozen: 84414596

From this I think it had to repair the slot, so I think that's probably why it took longer: repair_stats: [(84414593, 83), (84414591, 49), (84414596, 28)

bank frozen came quite late, 1 second after

From the compute stats:

[2021-06-25T20:18:57.977461038Z INFO  solana_metrics::metrics] datapoint: 
replay-slot-stats slot=84414596i fetch_entries_time=1764i 
fetch_entries_fail_time=0i 
entry_poh_verification_time=84630i 
entry_transaction_verification_time=3886i 
replay_time=109265i 
replay_total_elapsed=1322518i 
total_entries=687i total_shreds=464i 
check_us=1241i load_us=16248i execute_us=153461i store_us=33352i serialize_us=43234i create_vm_us=10011i execute_inner_us=490
37i deserialize_us=21030i changed_account_count=92i total_account_count=4816i total_data_size=425437527i data_si
ze_changed=5139787i

Actual compute time was only 110ms. This seems to have taken a lot of time: voting_elapsed=1002470

This seems related to an issue we are currently working on to make this faster.

brianlong commented 3 years ago

@sakridge Thanks again for digging into this issue. That data is super-helpful.

brianlong commented 3 years ago

Please let me know if I have visualized this correctly:

Legend for the diagram below.
  LEADER = A Leader's turn in the schedule. Each number represents a different
           Leader.
  SLOTS  = The four Slots allocated to each Leader. The scheduled duration of
           each Slot is determined by PoH (300ms?)
  WORK   = The actual duration of the blocks processed in each Leader's
           turn. The Work includes network time, compute time, repair time,
           voting time, etc. The repeated number indicates the Leader.

LEADER: |   1   |   2   |   3   |   4   |   5   |   6   |
SLOTS:  |1-2-3-4|1-2-3-4|1-2-3-4|1-2-3-4|S-K-I-P|1-2-3-4|
WORK:   |111111111|222222222|333333333|444444444|666666666|

When the actual duration of Work diverges from the fixed schedule, the only way to maintain the schedule is to skip Slots.

In the example above, the actual Work time for each Leader exceeds the allocated time for their four Slots. By the time we get to the 4th Leader, the actual duration of Work has consumed 5 Slots. To keep the Work in sync with the schedule, we skip all four of the 5th Leader's Slots. The schedule is back in sync starting with Leader 6.

In the impatient scenario, the "Leader Race" is the mechanism that provides for skipping Slots to maintain the schedule. In the patient scenario, we accept that block times are longer in order to skip fewer slots.

brianlong commented 3 years ago

Those skipped slots are not benign -- they are responsible for forks and failed transactions.

Take a moment to look at this situation from a different point of view. For example, When a price oracle is sending transactions during turn 5 in the schedule, the oracle expects to receive price results in a timely manner. If the skipped slots cause the RPC server to go off on a minority fork, the oracle will fail to receive results within their SLA window. Missing data points are very bad for an oracle.

Solution Ideas:

A) Adjust the schedule to be closer to the actual work. For example, add 100ms of sync time in leader transition to accommodate the extra data hop between the 4th slot of one leader and the 1st slot of the next leader. B) Reduce the actual work time. For example, improvements to Turbine propagation or voting time. Or, reduce the number of transactions in a block so that all of the required work can be done on schedule. The transaction limit can then be raised with productivity improvements while staying within the schedule. C) Both A & B

brianlong commented 3 years ago

BTW, that price oracle problem above is real. We are over-provisioning N:1 RPC servers with the hopes that 1/N servers is on the majority fork (highest root).

ryoqun commented 3 years ago

@brianlong well, I think RPC's confirmed commitment should solve the oracle problem because there should be no rollbacks/minor forks under that commitment?

ryoqun commented 3 years ago

also, I'm starting to see various observations from my rainbow toy:

firstly, here's the legend (sorry scribble letters by touch-pad)... lol

ryoqun commented 3 years ago

so, there were 3 forks at the time:


# common ancestor
[2021-07-10T11:52:51.290611598Z INFO  solana_runtime::bank] bank frozen: 84384151 hash: 6pMQ19uSrWXvULNqPRpkq4tbgj5DpnVwSjaXPdwBNAmj accounts_delta: 7xjmD8YFc2UcAQJJYHBEW2Sb9d9y4k1AUKzW8TuonTFR signature_count: 1697 last_blockhash: t1wHoRjFmwtETHL1jCCzwsH8mr6aHr8S2hGJvCb4PdD capitalization: 1372107650748206043

# leader 1

[2021-07-10T11:52:55.432744619Z INFO  solana_runtime::bank] bank frozen: 84384152 hash: AmMxzydr7dwdQeNpRLYkMa4FWN4FZ7o5CJMAAT7vrTeR accounts_delta: 6kNF15etcG7zHDu47hYWhvZ8yiX6tQQz3T2TpHC37517 signature_count: 1737 last_blockhash: 6LCzwXeejEDrxkVz92atxBhtKSqErzR8shEUTnH8YD9p capitalization: 1372107650743863543
[2021-07-10T11:52:55.676501540Z INFO  solana_runtime::bank] bank frozen: 84384153 hash: VLtNtWMd9C1FzLm6BjuodVaNZSpaHSFon7G5Vjswoge accounts_delta: ADxtrherg21c7t7kri3YNkRAvUBALzXRFiv7u719Lho8 signature_count: 1400 last_blockhash: DZNdW1md5gfudksXMo387Rtu6q6U5xGJ3otH4tXWEsMV capitalization: 1372107650740363543
[2021-07-10T11:52:55.949529607Z INFO  solana_runtime::bank] bank frozen: 84384154 hash: Cov2RpLWywiBvhJA7c6VPmHdp1SXNTeRMbfx1E3MppSs accounts_delta: DXZm7oHHzLgEe6xhgWAARr5Bo5F18eogdDbCFS8CJovE signature_count: 1373 last_blockhash: 3FTkEUuP4r91yZurb8ZM1CqVCzaQNvybzX8d6zUZN8Kg capitalization: 1372107650736931043
[2021-07-10T11:52:56.333091669Z INFO  solana_runtime::bank] bank frozen: 84384155 hash: E2rWdVk14ZS9JC4asSYMPL7JrJvaJRYg7dN8RZedjRG2 accounts_delta: 7ruoqiL3uvMoKuHHZJxzKURnCXbjWDV1RP12etja9dM3 signature_count: 1060 last_blockhash: 4wH6NcuhGqHSzWqY23LSHVUKK2uJx2P47JM9m6c6VeL2 capitalization: 1372107650734281043

# leader 2

[2021-07-10T11:52:56.182545907Z INFO  solana_runtime::bank] bank frozen: 84384156 hash: ZwAdogBX6FdNngrqKjh93XF2jSVqKHgpDHLDXkgusKz accounts_delta: ASCd4ufG3d9eMMMNvdPvFDTkeZiDvNX6SxNfvgQWYJiQ signature_count: 1014 last_blockhash: 2qXzDNJdTsbNrkykxFg6pcwhDQCoA8HB8w6719GtfWk5 capitalization: 1372107650745669824
[2021-07-10T11:52:56.525585006Z INFO  solana_runtime::bank] bank frozen: 84384157 hash: 6E4m6jX5fqeeWFkxzUVBizmD9fvhoArDNXzLZd88pGTm accounts_delta: 5DFrWhst8H5tpu9ZJtguZFyzouw68L7bik3KPk774hkN signature_count: 307 last_blockhash: PV3WDAAc6NDqfFtHPAJ6gCxkGJFjCvLxpzKkqSp8kWY capitalization: 1372107650744902324
[2021-07-10T11:52:56.796636287Z INFO  solana_runtime::bank] bank frozen: 84384158 hash: 29hS8hJ9YT4W6toQ4UJQTZxMAYeSNYHR7uH6jhzoW8PH accounts_delta: D7rNeCGrkoExNE1NKq7PrJNmLpMtMBvePp99UuoDuGkt signature_count: 90 last_blockhash: AucekQkeYbLPWL4Qt8BDk9pW5v3W17yRtvcRAMZpj7GM capitalization: 1372107650744677324
[2021-07-10T11:52:56.954019045Z INFO  solana_runtime::bank] bank frozen: 84384159 hash: DWVciMcG1UeDzSyjwupW6Zcj5WKv7wixTqhoMAV2HmUf accounts_delta: Entt59HzzaUroHu6q25qFRkNLsqMG2PDtYpnbuCa5VkT signature_count: 718 last_blockhash: 5UNbfMxddrsUC2BBixB1MVvpxEAREUF6sE4fAzbcfzzC capitalization: 1372107650742881105

# leader 3

[2021-07-10T11:52:57.370618866Z INFO  solana_runtime::bank] bank frozen: 84384160 hash: 9zuqetuzpWT2hYj2Ve2Va2P4nNRRJg1conVpjzEgDpff accounts_delta: 8sb1WSPtkmozbd1DcBVadhSu6kSF5pRwvEhZRMohrqGf signature_count: 1214 last_blockhash: 3QS9ZD6PmVv2pm3m4TLZAQte3QZt7aeW3tg8EJi8MLDY capitalization: 1372107650745168604
[2021-07-10T11:52:57.836210211Z INFO  solana_runtime::bank] bank frozen: 84384161 hash: ALAB7qoSyLgZmYsoXrgXvHX1pjbG4d2vXsT9K81f1SU1 accounts_delta: 2T6FLPBZzaXFc661U5GVyBydTHEqMmRfrpmXdZr6iYAf signature_count: 745 last_blockhash: 8sw41LwYpBQi6feZHZgobCz8QFTRAZNvHoKgzdYNZ8hG capitalization: 1372107650743306104
[2021-07-10T11:52:58.039836032Z INFO  solana_runtime::bank] bank frozen: 84384162 hash: ELB6bkjN9d8v1hc68YfGf8oskrQMHxYX17F2DrrgMN7p accounts_delta: 65gAVxKNNfSrpAy2a7Ar7x4ACTE7A3uacpbsEn9memJW signature_count: 754 last_blockhash: 8bCCuf88zW4F5vDXAoucj3wsJXmGHtuRHFD7s1LvNBtV capitalization: 1372107650741421104
[2021-07-10T11:52:58.174655723Z INFO  solana_runtime::bank] bank frozen: 84384163 hash: 6KsoRmCpNS9TfvN7GZBwRanySiMgivyrXw3x4RCV1afp accounts_delta: Fa4YyaxVbSQmSS6zkiUw1BbezQTmtGucLYNSLg4VrRnA signature_count: 89 last_blockhash: 5KYyhnPSB9vhcSaXwu6TTSRmDH2pNszTYQWBYqHQLyn capitalization: 1372107650741198604

ryoqun commented 3 years ago

the root cause of forks is that leader 1's blocks are really slow to propagate (it seems this is partly dependent on the transaction count in a block; signature_count in the above log messages are approximate):

the odd part is the 4th slot for the leader 1 got full across the cluster before than its 1st slot:

ryoqun commented 3 years ago

then, leader 2 and leader 3 are naitvely continue to create minor fork even after they supposedly had received is_full signals for leader 1's fork:

Then, their forks are abandoned by the whole cluster... I guess they should be better off waiting leader 1 more patiently or at least interrupt their fork and hold their remaining leader slots to connect to leader 1's fork, as soon as is_full signal received?

ryoqun commented 3 years ago

lastly, opt conf seems slow via gossip (maybe not turbine because no blocks in the timeframe. it can't be helped because the 2 following leaders jumped the gun.):

these are under normal distiribution with 2x variance (using only eyes for these staticstics thing... lol)?

carllin commented 3 years ago

@ryoqun thanks for the great analysis, this tool is super useful for narrowing down cluster-wide behavior!

So it seems like there's some correlation between number of transactions and how long things take to propagate. Combined with this observation:

the odd part is the 4th slot for the leader 1 got full across the cluster before than its 1st slot:

I wonder if this is a failure in turbine to propagate some number of the blocks. You would think that as long as turbine propagated at least 50% of the blocks, we should get recovery (unless erasure recovery is also really slow?).

For that block where 4th slot for the leader 1 got full across the cluster before than its 1st slot, we can look at the logs/metrics to see how many % of those shreds were propagated via turbine successfully for each node by summing across the retransmit-slot-num-packets datapoint for a specific slot S

sakridge commented 3 years ago

@jbiseda this might be a good thing to look at why turbine is taking so long here.

@carllin in this case there was no hope for leader 1 to start it's first slot (84414596) from 84414595, but potentially it could have chosen 84414595 as the parent of 84414598 and/or 84414599. That might have hedged the bet that 84414595 would not be the chosen fork. I also wonder what kind of other propagation signals (epoch slots/votes) might be there that might have even indicated that was a smart thing to do. What do you think?

carllin commented 3 years ago

@sakridge hmmm yeah, this is an interesting case.

there was no hope for leader 1 to start it's first slot (84414596) from 84414595, but potentially it could have chosen 84414595 as the parent of 84414598 and/or 84414599.

Clarification: I think you mean 84414592 not 84414595, where 84414592 is where the other forks are building off of?

Analysis:

First It's good to note for some background (tagging @jstarry, @AshwinSekar since they were part of this review):

1) The fork choice rule currently does factor in votes from gossip that haven't landed in a block via latest_validator_votes_for_frozen_banks, which is a mapping from each validator to the greatest slot they've casted a vote for. That structure is updated here https://github.com/solana-labs/solana/blob/47dad2390a159403fcea549d5784d3773390e231/core/src/replay_stage.rs#L443-L447. This means that votes that have propagated for 84414592 should make it into the fork choice.

2) Switching proofs factor in gossip votes through the same latest_validator_votes_for_frozen_banks: https://github.com/solana-labs/solana/blob/47dad2390a159403fcea549d5784d3773390e231/core/src/consensus.rs#L791-L835. However, because we don't reason about lockouts for gossip votes (only votes in vote states have lockouts), then switching proofs only include gossip votes that are greater than our last vote: https://github.com/solana-labs/solana/blob/47dad2390a159403fcea549d5784d3773390e231/core/src/consensus.rs#L796.

Behavior we're observing:

So essentially we have an issue that looks like

       |----------- 84414592 (major fork)
84414591
       | ------------84414596 (DDnAqxJVFo2GVTujibHt5cjevHMSE9bo8HJaydHoshdp forked off and created this fork)

Given that this node DDnAqxJVFo2GVTujibHt5cjevHMSE9bo8HJaydHoshdp had already voted on its own fork: [2021-06-25T20:18:56.662858613Z INFO solana_core::replay_stage] voting: 84414596 0

before the bank 84414592 was frozen: [2021-06-25T20:18:56.784362551Z INFO solana_core::replay_stage] bank frozen: 84414592

Then this node would pick 84414592 as the heaviest bank and reset to it, even if it's already build some of its leader slots on 84414591, only if it saw sufficient votes in gossip for 84414592 such that:

1) There were more votes for 84414592 than for its own fork at 84414596 2) Given 1), there were sufficient votes for 84414592 to generate a switching proof. In this scenario, however, it would be impossible to generate a switching proof for based on gossip votes for 84414592 because votes for 84414592 are less than our last vote for `84414596 (recall description of switching proofs and gossip votes in 2) above).

So indeed, based on the logs we kind of see a scenario where 1) happens but 2) fails briefly:

<1) Last common ancestor>
[2021-06-25T20:18:54.606335912Z INFO  solana_core::replay_stage] bank frozen: 84414591

<2) We create fork 84414591->84414596>
[2021-06-25T20:18:56.089158009Z INFO  solana_core::replay_stage] new fork:84414596 parent:84414591 (leader) root:84414541
[2021-06-25T20:18:56.662858613Z INFO  solana_core::replay_stage] voting: 84414596 0

<3) Major fork's ancestor 84414592 is received and frozen>
[2021-06-25T20:18:55.845568712Z INFO  solana_core::replay_stage] new fork:84414592 parent:84414591 root:84414541
[2021-06-25T20:18:56.784362551Z INFO  solana_core::replay_stage] bank frozen: 84414592

<4) We extend our fork 84414591->84414596->84414597>
[2021-06-25T20:18:56.733194658Z INFO  solana_core::replay_stage] new fork:84414597 parent:84414596 (leader) root:84414541

<5) Gossip finally sees 52% of the votes for 84414591, so the choice for the heaviest bank is between our fork at 84414596, and the main fork at 84414592>
[2021-06-25T20:18:56.803767830Z INFO  solana_core::cluster_slot_state_verifier] check_slot_agrees_with_cluster()
slot: 84414591,
root: 84414541,
frozen_hash: Some(DxSeioHLYRqsKXR6D8pHLFT4zdv2YR4LayMqq8fqWQbR),
update: DuplicateConfirmed

<6) Nice! We recognized the major fork 84414592 as heavier than our fork, but we couldn't generate a switching proof, so we don't reset our PoH to the major fork. This means we will continue to build on our own leader fork. The reason we did this in the past was *when we didn't include gossip votes into fork choice and switch thresholds as described above* was to avoid a liveness problem described here: https://github.com/solana-labs/solana/blob/47dad2390a159403fcea549d5784d3773390e231/core/src/replay_stage.rs#L2178-L2194>
[2021-06-25T20:18:56.808214077Z INFO  solana_core::replay_stage] Waiting to switch vote to 84414592, resetting to slot None for now

<7) Now we saw enough votes for our fork at 84414596, to decide that our own fork is the heaviest again, and we vote on it>
[2021-06-25T20:18:57.087714858Z INFO  solana_runtime::bank] bank frozen: 84414597 hash: Gx8fjuEPkCn2MABx1ywm4zwYfxVug5JKeJ4xc3qbVFBZ
[2021-06-25T20:18:57.143173306Z INFO  solana_core::replay_stage] voting: 84414597 0

In summary: 1) >but potentially it could have chosen 84414595 as the parent of 84414598 and/or 84414599

To @sakridge 's point here, the switching threshold prevented us from building 84414598 and 84414599 on the major fork 84414592. It could be possible that it's safe now to change the reset bank behavior in 6) above to reset to the heaviest fork, even if we fail to generate a switching threshold to the heaviest fork (still have to reason this through). However, say we did reset to 84414592 and built 84414597 on top of 84414592, are we still behind the target? This is because we got 84414592 2 seconds late, and maybe the tip of the major fork is actually something like 84414593 at this point?

2) Step 7) above makes me think that we didn't see a lot of votes for 84414592 because we ultimately decided to vote on our own fork again. I also don't see a duplicate confirmed log similar to 5) for slot 84414592, so I know we received less than 52% of the cluster's votes exactly on 84414592 (we only count votes where the latest slot == 84414592, so it's possible they skipped this vote due to the threshold check, but I don't see any later descendants getting duplicate confirmed in the logs either). Even if we modified the switching threshold reset behavior in 1), we still wouldn't reset to 84414592 if we don't see it as the heaviest fork.

sakridge commented 3 years ago

@sakridge hmmm yeah, this is an interesting case.

there was no hope for leader 1 to start it's first slot (84414596) from 84414595, but potentially it could have chosen 84414595 as the parent of 84414598 and/or 84414599.

Clarification: I think you mean 84414592 not 84414595, where 84414592 is where the other forks are building off of?

Yes sorry, slot 84414592.

brianlong commented 3 years ago

@brianlong well, I think RPC's confirmed commitment should solve the oracle problem because there should be no rollbacks/minor forks under that commitment?

Thanks so much for your excellent work on this issue. Regarding servers being stuck on a lower root, here is an example from earlier today on mainnet RPC:

#  Root            Server
0  86767658  sg3.rpcpool.com
1  86767658  sg1.rpcpool.com
2  86767658  fra6.rpcpool.com
3  86767658  nyc4.rpcpool.com
4  86767726  nyc1.rpcpool.com
5  86767729  hk1.rpcpool.com
6  86767729  sfo1.rpcpool.com
7  86767730  ams1.rpcpool.com
8  86767730  dus3.rpcpool.com
9  86767730  dus1.rpcpool.com

For server #s 0-3, the root is 68 to 72 slots lower than the other six servers. For defi apps, I'm assuming that servers 0-3 will return inaccurate results until their root catches up with the rest of the cluster. I hope that a smooth transition from one leader to the next can eliminate most of the forks that we see. The problems being solved here will make a huge difference for the apps.

ryoqun commented 3 years ago

@brianlong well, I think RPC's confirmed commitment should solve the oracle problem because there should be no rollbacks/minor forks under that commitment?

Thanks so much for your excellent work on this issue. Regarding servers being stuck on a lower root, here is an example from earlier today on mainnet RPC:
#  Root            Server
0  86767658  sg3.rpcpool.com
1  86767658  sg1.rpcpool.com
2  86767658  fra6.rpcpool.com
3  86767658  nyc4.rpcpool.com
4  86767726  nyc1.rpcpool.com
5  86767729  hk1.rpcpool.com
6  86767729  sfo1.rpcpool.com
7  86767730  ams1.rpcpool.com
8  86767730  dus3.rpcpool.com
9  86767730  dus1.rpcpool.com
For server #s 0-3, the root is 68 to 72 slots lower than the other six servers. For defi apps, I'm assuming that servers 0-3 will return inaccurate results until their root catches up with the rest of the cluster. I hope that a smooth transition from one leader to the next can eliminate most of the forks that we see. The problems being solved here will make a huge difference for the apps.

yeah, definitely the current forky cluster is contributing the differing root confirmation timings across multiple nodes. And improvement of skip rate directly improves UX in general.

However, if your defi apps are concerned enough about timing, it would be better for defi apps to do their read consistency management at client side or for the rpcpools to implement session affinity. After all, minor forks are inevitable no matter what due to delinquent leader or small across-globe networking glitches.

Personally, i care about these consistency issue (I'm a nit-picky guy... lol). so I'm intending to fix at web3 level but no time up to now...: https://github.com/solana-labs/solana/issues/15705

sakridge commented 3 years ago

@brianlong well, I think RPC's confirmed commitment should solve the oracle problem because there should be no rollbacks/minor forks under that commitment?

Thanks so much for your excellent work on this issue. Regarding servers being stuck on a lower root, here is an example from earlier today on mainnet RPC:

For server #s 0-3, the root is 68 to 72 slots lower than the other six servers. For defi apps, I'm assuming that servers 0-3 will return inaccurate results until their root catches up with the rest of the cluster. I hope that a smooth transition from one leader to the next can eliminate most of the forks that we see. The problems being solved here will make a huge difference for the apps.

I wouldn't say for 100% certain, but I don't think those differences are really due to forking. All the nodes are essentially processing the forks in the same way. There are probably timing differences between nodes though in terms of receiving network packets or encountering processing delays.

I think providing logs or metrics to compare 2 of the nodes with a large difference might tell us something about it.

edit... I noticed I disagree a bit with Ryo here. A new root for all nodes could be delayed because of forking, but if one node has set a fork and another hasn't, it should just mean that the slower node has not processed blocks yet which are already available in the network.

brianlong commented 3 years ago

I spent some time looking at another sequence of skipped slots from the point of view of three RPC nodes. This is the slot sequence from epoch 202 (annotated with location)

87286584        DWvDTSh3qfn88UoQTEKRV2JnLt5jtJAVoiCo3ivtMwXP         199610-DE-Frankfurt am Main
87286585        DWvDTSh3qfn88UoQTEKRV2JnLt5jtJAVoiCo3ivtMwXP         199610-DE-Frankfurt am Main
87286586        DWvDTSh3qfn88UoQTEKRV2JnLt5jtJAVoiCo3ivtMwXP         199610-DE-Frankfurt am Main
87286587        DWvDTSh3qfn88UoQTEKRV2JnLt5jtJAVoiCo3ivtMwXP         199610-DE-Frankfurt am Main

87286588        FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED 16509-JP-Tokyo
87286589        FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED 16509-JP-Tokyo
87286590        FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED 16509-JP-Tokyo
87286591        FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED 16509-JP-Tokyo

87286592        2Ce1dmtdnvgDwD1MiZjP9wwGup6j8H7uXGuGMa9uZo5v         16276-FR-Roubaix
87286593        2Ce1dmtdnvgDwD1MiZjP9wwGup6j8H7uXGuGMa9uZo5v         16276-FR-Roubaix
87286594        2Ce1dmtdnvgDwD1MiZjP9wwGup6j8H7uXGuGMa9uZo5v         16276-FR-Roubaix
87286595        2Ce1dmtdnvgDwD1MiZjP9wwGup6j8H7uXGuGMa9uZo5v         16276-FR-Roubaix

87286596        AZhAEf79gCXexHTasaByZQS8fks73aCUkDW8rRVLiiXT SKIPPED 16509-JP-Tokyo
87286597        AZhAEf79gCXexHTasaByZQS8fks73aCUkDW8rRVLiiXT SKIPPED 16509-JP-Tokyo
87286598        AZhAEf79gCXexHTasaByZQS8fks73aCUkDW8rRVLiiXT SKIPPED 16509-JP-Tokyo
87286599        AZhAEf79gCXexHTasaByZQS8fks73aCUkDW8rRVLiiXT SKIPPED 16509-JP-Tokyo

87286600        tDw82DpNQwC4sUqRt5zCihm1N9ktta51ZyXbWewMsWp  SKIPPED 6509-US-America/Chicago
87286601        tDw82DpNQwC4sUqRt5zCihm1N9ktta51ZyXbWewMsWp  SKIPPED 6509-US-America/Chicago
87286602        tDw82DpNQwC4sUqRt5zCihm1N9ktta51ZyXbWewMsWp  SKIPPED 6509-US-America/Chicago
87286603        tDw82DpNQwC4sUqRt5zCihm1N9ktta51ZyXbWewMsWp  SKIPPED 6509-US-America/Chicago

87286604        8UQ45XfTAHb8n6kMhix3MAsinKggHTGja4LE6nnQsJPf SKIPPED 16509-IE-Dublin
87286605        8UQ45XfTAHb8n6kMhix3MAsinKggHTGja4LE6nnQsJPf SKIPPED 16509-IE-Dublin
87286606        8UQ45XfTAHb8n6kMhix3MAsinKggHTGja4LE6nnQsJPf SKIPPED 16509-IE-Dublin
87286607        8UQ45XfTAHb8n6kMhix3MAsinKggHTGja4LE6nnQsJPf SKIPPED 16509-IE-Dublin

87286608        DWvDTSh3qfn88UoQTEKRV2JnLt5jtJAVoiCo3ivtMwXP SKIPPED 199610-DE-Frankfurt am Main
87286609        DWvDTSh3qfn88UoQTEKRV2JnLt5jtJAVoiCo3ivtMwXP SKIPPED 199610-DE-Frankfurt am Main
87286610        DWvDTSh3qfn88UoQTEKRV2JnLt5jtJAVoiCo3ivtMwXP SKIPPED 199610-DE-Frankfurt am Main
87286611        DWvDTSh3qfn88UoQTEKRV2JnLt5jtJAVoiCo3ivtMwXP SKIPPED 199610-DE-Frankfurt am Main

87286612        XkCriyrNwS3G4rzAXtG5B1nnvb5Ka1JtCku93VqeKAr  SKIPPED 15169-US-North Charleston
87286613        XkCriyrNwS3G4rzAXtG5B1nnvb5Ka1JtCku93VqeKAr  SKIPPED 15169-US-North Charleston
87286614        XkCriyrNwS3G4rzAXtG5B1nnvb5Ka1JtCku93VqeKAr  SKIPPED 15169-US-North Charleston
87286615        XkCriyrNwS3G4rzAXtG5B1nnvb5Ka1JtCku93VqeKAr  SKIPPED 15169-US-North Charleston

87286616        3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS SKIPPED 51167-DE-Düsseldorf
87286617        3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS SKIPPED 51167-DE-Düsseldorf
87286618        3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS SKIPPED 51167-DE-Düsseldorf
87286619        3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS SKIPPED 51167-DE-Düsseldorf

87286620        91yWMWM2wtsftTtgE1HA1bseRhLAg7yCuHg5GKFHn9r3         16509-CN-Asia/Shanghai
87286621        91yWMWM2wtsftTtgE1HA1bseRhLAg7yCuHg5GKFHn9r3         16509-CN-Asia/Shanghai
87286622        91yWMWM2wtsftTtgE1HA1bseRhLAg7yCuHg5GKFHn9r3         16509-CN-Asia/Shanghai
87286623        91yWMWM2wtsftTtgE1HA1bseRhLAg7yCuHg5GKFHn9r3         16509-CN-Asia/Shanghai

I have also attached a zip file with logs & a spreadsheet showing a side-by-side comparison. There are few points that stand out for me.

The sequence of locations from Europe => Japan => Europe => Japan may have contributed to the problem due to network latency.
After a leader delivered the first shred of its first slot, the average time to deliver the first shred of slots 2-4 was over 0.9 seconds per slot. We should expect to see a high skipped slot rate until this elapsed time is reduced.
At two points, the leader created a fork from an out-of-sequence parent slot during their 4-slot sequence (e.g. switched forks). When that happened, the time to deliver the first shred was over 3.3 seconds. Delivering data after that much time doesn't make sense because the cluster has already moved on. In a case like this, it might make more sense for the leader to abandon its remaining slots instead of building on the new fork. The cluster can use the remaining slot times to sync up and be ready for the next leader.

skips-20210716.zip

brianlong commented 3 years ago

I'm not sure what "abandon its remaining slots" above will actually look like. Deliver no blocks? Just ticks? Say 'pass' like when playing poker?

ryoqun commented 3 years ago

(CodePen - consensus2.pdf this is the rainbow chart of same time window)

I'll dig in later...

ryoqun commented 3 years ago

@brianlong sorry, i couldn't reply in timely manner at all. i was too busy to fix the skip problem... still working on this. and as part of this, I've finally made the rainbow tool accessible (hopefully) for all (buggy and slow still): https://codepen.io/ryoqun/full/JjNbeyE you can input your validator identity and see how it produced blocks (and possibly how you won the main fork if there are competing forks).

ryoqun commented 3 years ago

As an illustration, this is how to read the chart. hope labor-intensive log comparison is the past thing:

this shows how your most recent 5th leader slot segment got ended up => only the first slot is skipped
your first block (leader blocks are large rectangles) was produced by skipping immediate leader (violet block) (by noticing red dotted curve chained to the red block) because the leader was slow to start to propagate block
- as compared to other nodes, for some reason, your is_full status was slow. (notice shifted violet circle)
- but quickly after, your node decides to abandone its own fork. because the violet block became available.
- then, your node created three successive block chaining to the last block of immediate previous leader.
- finally, your those 3 blocks are picked by the next leader rather smoothly.

brianlong commented 3 years ago

@ryoqun Woah! That's a cool trick. I'll need a little time to dig into this. Thanks!

solana-labs / solana