solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.34k stars 4.35k forks source link

Stability regression on GCE CPU-only perf testnet #6707

Closed danpaul000 closed 5 years ago

danpaul000 commented 5 years ago

Problem

Nightly CPU-only 5 node GCE performance testnets have a stability regression. The cluster fell over shortly into the test starting on the night of November 2.

Buildkite job: https://buildkite.com/solana-labs/system-performance-tests/builds/461 Grafana: https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge?var-testnet=gce-edge-perf-cpu-only&from=1572760888997&to=1572761835113&refresh=60s&orgId=2 Commit: https://github.com/solana-labs/solana/commit/9ea398416e5b9388d625fd5f0c5dda5312084422

The GPU enabled 5 node testnet on the same night/same commit appeared to run fine: Grafana: https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge?var-testnet=gce-edge-perf-gpu-enabled&from=1572757262197&to=1572758435533&refresh=60s&orgId=2

BK: https://buildkite.com/solana-labs/system-performance-tests/builds/460


The CPU only tests had been running with consistent results (~32-35k TPS) since at least Oct 25, before 0.20.0 was released. Last successful nightly CPU run was the night of Oct 31 against commit https://github.com/solana-labs/solana/commit/2d67962c2f70e6c4e70eece0053db96fb1d3a733

Grafana: https://metrics.solana.com:3000/d/testnet-edge/testnet-monitor-edge?var-testnet=gce-edge-perf-cpu-only&from=1572588077795&to=1572589027188&refresh=60s&orgId=2 BK: https://buildkite.com/solana-labs/system-performance-tests/builds/455

danpaul000 commented 5 years ago

cc: @mvines @sagar-solana

sakridge commented 5 years ago

regression window:

9ea398416 Sign shreds on the GPU (#6595) (1 hour, bad, 1 hour good)
50a17fc00 Use Slot and Epoch type aliases instead of raw u64 (#6693) - (2 good, 1 bad)
f9a9b7f61 Better output layout for iftop logs (#6690) - (1 good, 1 good of 1 hour)
a57f6b70d Fix swapped repair and forwards addrs (#6691) - (1 good)
bae83ba2b Compare iftop logs using log-analyzer (#6684)
385b4ce95 Get rid of verified packets and use the Meta::discard flag (#6674)
7b6e3a23b Add new pubkey to auth keys (#6687)
1cc8956f7 Get Azure provider working again (#6659)
e6c8bfd00 Add --use-move flag to cargo-install-all.sh and net/net.sh (#6670) - (3 good)
2d67962 1 hour good
danpaul000 commented 5 years ago

Unable to reproduce.