status-im / nimbus-eth2

Nim implementation of the Ethereum Beacon Chain
https://nimbus.guide
Other
517 stars 222 forks source link

Low sync committee performance with Vouch #5324

Open yorickdowne opened 1 year ago

yorickdowne commented 1 year ago

Describe the bug

Nimbus 23.4.0 with Vouch 1.7.5, sync committee performance on mainnet was consistently in the 90% range. Replacing Nimbus with Teku resolved it.

Testing this on mainnet is a no-go, but: I have 1k validators on Goerli w/ Vouch. While Goerli has its own issues, maybe that's good enough to find the issue? This one is Nimbus 23.6.1.

If so, we can use this ticket to sleuth a bit.

To Reproduce

Run Vouch with multiple CL:EL, one of them being Nimbus:something.

eth-goerli-consensus-1  | INF 2023-08-19 16:45:48.883+00:00 Sync committee message sent                message="(slot: 6329629, beacon_block_root: \"fa7b0eca\", validator_index: 394114, signature: \"a400b7c2\")" delay=-3s117ms56us567ns
eth-goerli-consensus-1  | INF 2023-08-19 16:45:48.883+00:00 Sync committee message sent                message="(slot: 6329629, beacon_block_root: \"fa7b0eca\", validator_index: 473561, signature: \"b555d5cb\")" delay=-3s116ms861us507ns
eth-goerli-consensus-1  | INF 2023-08-19 16:46:00.946+00:00 Sync committee message sent                message="(slot: 6329630, beacon_block_root: \"407728aa\", validator_index: 394012, signature: \"b20739e2\")" delay=-3s53ms631us52ns
eth-goerli-consensus-1  | INF 2023-08-19 16:46:00.946+00:00 Sync committee message sent                message="(slot: 6329630, beacon_block_root: \"407728aa\", validator_index: 480741, signature: \"b0d1da70\")" delay=-3s53ms185us191ns
eth-goerli-consensus-1  | INF 2023-08-19 16:46:00.947+00:00 Sync committee message sent                message="(slot: 6329630, beacon_block_root: \"407728aa\", validator_index: 394114, signature: \"94059343\")" delay=-3s52ms976us593ns
eth-goerli-consensus-1  | INF 2023-08-19 16:46:00.947+00:00 Sync committee message sent                message="(slot: 6329630, beacon_block_root: \"407728aa\", validator_index: 473561, signature: \"b7239556\")" delay=-3s52ms748us968ns

https://goerli.beaconcha.in/validator/473561, 76% or so

yorickdowne commented 1 year ago

Will update Nimbus and turn on debug logs. Hopefully that sheds more light. If there's no clue, we can go to trace.

yorickdowne commented 1 year ago

Delay is better by a good second after restart

eth-goerli-consensus-1  | INF 2023-08-19 16:53:49.695+00:00 Sync committee message sent                message="(slot: 6329669, beacon_block_root: \"b8c7ba25\", validator_index: 394012, signature: \"8a5be233\")" delay=-2s304ms129us161ns
eth-goerli-consensus-1  | INF 2023-08-19 16:53:49.696+00:00 Sync committee message sent                message="(slot: 6329669, beacon_block_root: \"b8c7ba25\", validator_index: 480741, signature: \"a6c194f7\")" delay=-2s304ms54us679ns
eth-goerli-consensus-1  | INF 2023-08-19 16:53:49.696+00:00 Sync committee message sent                message="(slot: 6329669, beacon_block_root: \"b8c7ba25\", validator_index: 394114, signature: \"b1b8976f\")" delay=-2s303ms971us629ns
eth-goerli-consensus-1  | INF 2023-08-19 16:53:49.696+00:00 Sync committee message sent                message="(slot: 6329669, beacon_block_root: \"b8c7ba25\", validator_index: 473561, signature: \"ab2de91d\")" delay=-2s303ms921us808ns
yorickdowne commented 1 year ago

Initial debug logs after restart on 23.7.0

nimbus-debug.log.gz

yorickdowne commented 1 year ago

Checking participation on Goerli overall ... with this many blocks missing, we may not find a cause. 76% is in line with what you'd expect on a network with <80% participation.

Let me know whether you're keen to find root cause. If so, I can run something by Izzy, suggest to use 1,000 keys on mainnet as a sacrificial lamb for one sync committee period and no longer, and get you either debug or trace logs from it, whichever you prefer.

etan-status commented 1 year ago

What validator client are you using?

yorickdowne commented 1 year ago

Vouch 1.7.5. See first post :)

Fellow NOs opine that Vouch in particular causes Nimbus to struggle with sync committee, whereas Vouch does not cause sync committee issues with Teku, Lighthouse, Prysm, and Lodestar.

etan-status commented 1 year ago

Going through your validator history, this doesn't seem to be a Nimbus specific issue.

https://goerli.beaconcha.in/validator/473561#sync

Red are:

  1. 6329660 --> BN was not running, as you just restarted for collecting debug logs
  2. 6329661 --> You just restarted the BN, in your logs it is evident that peers are still being looked for
  3. 6329662 --> Same, "No peers for topic, skipping publish"
  4. 6329726 --> You were missed, but overall only ~200 participation in this block, compared to the usual ~400, suggesting this to be a block proposer problem, not yours.
  5. 6329739 --> Block proposer created a block but did not include any sync committee messages, not your fault
  6. 6329761 --> Similar as fourth one, but only ~150 participation in this block. Not specific to you.
  7. 6329799 --> Same as 5, block proposer did not include any sync messages

All the others were simply missed blocks (yellow "missed"), meaning it was not possible to get your signature in.

Could you please elaborate on "Replacing Nimbus with Teku resolved it."?

yorickdowne commented 1 year ago

You’re looking at the Goerli validator. Let’s get back to the original mainnet use case.

I have validators running for Lido. Vouch is the VC, and then there are three CL:EL. Lighthouse:Erigon, Teku:Besu and Nimbus:Nethermind. Sync participation was sub optimal, and I’d get alerts for it from monitoring. I asked other NOs about it and heard that while Nimbus VC to Nimbus CL works perfectly, Vouch to Nimbus CL caused issues for one other NO as well. They did however not gather debug logs, and instead chose to just solve the issue by replacing Nimbus.

I did likewise and replaced all instances of Nimbus with Teku, and all the alerts stopped, everything is working well.

This GitHub issue is me offering to work with you in figuring out root cause. You are right we can’t do it on Goerli. So, I’d need to reintroduce Nimbus on mainnet for a while, gather either debug or trace logs, see whether the issue is still around. If so, remove Nimbus again, and upload logs.

Messing with sync performance in production is a little iffy, but Lido is very interested in client diversity, so I think for a time limited test, we can do this.

etan-status commented 1 year ago

One difference in behaviour between Nimbus VC and Vouch is that Nimbus VC tends to wait until 4s into the slot before sending the sync committee message, which brings the delay close to 0s.

In case of vouch, it seems to immediately send sync committee messages as soon as a new block is received, which you can confirm in the logs with the negative delays. Delays are measured relative to the 4s mark, if it's early, it's negative, if it's late, it's a positive number.

As you are suggesting that the Nimbus VC works fine, maybe vouch is simply a bit too aggressive with sending its sync committee message, and Teku maybe delays the message for a bit internally before trying to send it on the network. It is difficult to reason about that without logs, though. From your Goerli logs, it doesn't seem that we are too early, but it's earlier than what happens with the Nimbus VC.

yorickdowne commented 1 year ago

OK. I'll give this whirl in the coming week and we can see how it behaves.

yorickdowne commented 1 year ago

Getting a temperature check from the Lido community before I kick this off: https://research.lido.fi/t/mainnet-sync-committee-performance-test/5253

yorickdowne commented 11 months ago

We ran into another issue instead, which may however be the same issue tbh. Once #5446 is fixed we will test again and update both issues.

catwith1hat commented 3 months ago

@yorickdowne Do you maybe have an update on this issue? Thank you.

yorickdowne commented 3 months ago

We haven’t seen issues recently. I believe this is resolved