solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.19k stars 4.3k forks source link

Performance issue with repair add-votes #13138

Open sakridge opened 4 years ago

sakridge commented 4 years ago

Problem

I noticed repair is taking a long time sometimes on my node connected to mainnet:

(datetime.timedelta(seconds=4, microseconds=478894), '[2020-10-23T21:50:54.864529096Z INFO  solana_core::repair_service] repairs?:
(datetime.timedelta(seconds=6, microseconds=451140), '[2020-10-23T20:01:20.019819539Z INFO  solana_core::repair_service] repairs?:
(datetime.timedelta(seconds=6, microseconds=517963), '[2020-10-23T19:34:03.921828406Z INFO  solana_core::repair_service] repairs?:
(datetime.timedelta(seconds=10, microseconds=442312), '[2020-10-23T20:01:10.152698978Z INFO  solana_core::repair_service] repairs?
(datetime.timedelta(seconds=107, microseconds=432769), '[2020-10-23T21:50:50.385635730Z INFO  solana_core::repair_service] repairs
(datetime.timedelta(seconds=157, microseconds=389314), '[2020-10-23T19:52:35.284777026Z INFO  solana_core::repair_service] repairs
(datetime.timedelta(seconds=200, microseconds=488747), '[2020-10-23T19:49:28.355419060Z INFO  solana_core::repair_service] repairs
(datetime.timedelta(seconds=282, microseconds=304765), '[2020-10-23T19:27:53.721043396Z INFO  solana_core::repair_service] repairs
(datetime.timedelta(seconds=310, microseconds=451186), '[2020-10-23T19:33:04.172229374Z INFO  solana_core::repair_service] repairs
(datetime.timedelta(seconds=480, microseconds=148743), '[2020-10-23T20:00:59.710386692Z INFO  solana_core::repair_service] repairs
(datetime.timedelta(seconds=676, microseconds=32030), '[2020-10-23T21:46:06.378873244Z INFO  solana_core::repair_service] repairs?
(datetime.timedelta(seconds=1562, microseconds=996158), '[2020-10-23T21:34:35.724483580Z INFO  solana_core::repair_service] repair

Seems that add-votes-elapsed is taking up to hundreds of seconds:

[2020-10-23T20:01:10.159809234Z INFO  solana_metrics::metrics] datapoint: serve_repair-repair-timing
set-root-elapsed=0i
get-votes-elapsed=76845i
add-votes-elapsed=9711878i    <<<<---
get-best-orphans-elapsed=381i
get-best-shreds-elapsed=545767i
send-repairs-elapsed=6628i

Proposed Solution

Optimize

sakridge commented 4 years ago

cc @carllin

carllin commented 4 years ago

Oof! Was the validator really far behind/starting from older snapshot? This may be happening when the validator is really far behind and there's a lot of slots it needs to chain from the tip of the cluster back to its current root.

For cases where the validator is really far behind, should we have an alternate forward repair path that just asks trusted validators for roots (instead of doing the chaining back from the tip and repairing orphans)?