Closed mvines closed 6 years ago
Seems ok now, restarting the VM probably helped get past the GCP throttling (until it gets tagged again). Leaving this issue open for now to monitor.
Issue has been observed again
We could send some txs in that loop to poll for get_last_id to force the server to create a new one. Another fix could be to lower the tx_count that might help.
I bet that once #718 lands and validators vote once a second regardless this'll get better
I've not seen this reproduce recently. Opportunistically closing until it reoccurs
This is reproducing on the testnet when CUDA is disabled.
I've had some success reproducing this issue locally, and have discovered I cannot reproduce the issue when I swap this .par_iter()
for a normal .iter()
:
https://github.com/solana-labs/solana/blob/313fed375cd1ff631ac1511cf444cf1aeeda0e81/src/banking_stage.rs#L63
Essentially it looks like the leader's banking_stage is getting stuck while deserializing packets, which causes last_id
to stall and eventually time out bench-tps
.
Could this be a rayon bug?
We're on rayon 1.0.0 and 1.0.2 is available, but upgrading to 1.0.2 didn't help.
par_iter() was falling over for @pgarg66 when the rayon thread pool was exhausted by the size of the work queue and the work required IO... not sure if the last condition was required to make it fall over, though
Reproduces 100% with:
export RAYON_LOG=1
export RAYON_NUM_THREADS=1
Reproduces 0% (4/4 attempts) with:
export RAYON_LOG=1
export RAYON_NUM_THREADS=128
Using https://raw.githubusercontent.com/rayon-rs/rayon/master/scripts/analyze.sh on a run with RAYON_NUM_THREADS=1
:,
$ bash analyze.sh log-leader.txt
Injected jobs: 32
Joins: 312
Popped jobs: 310 = rhs 310 + other 0
Found work: 12
Stole work: 0
Uninjected: 12
Join balance: 2
Inj. balance: 20
Total balance: -22
I also see that FoundWork { worker: 0, yields: 0 }
log messages halt after the stuck par_iter()
.
The other user of .par_iter()
is https://github.com/solana-labs/solana/blob/313fed375cd1ff631ac1511cf444cf1aeeda0e81/src/sigverify.rs#L82-L92, which is interesting because it's a nested par_iter()
. I bet that's not common...
Removing the outer into_par_iter()
prevents the deadlock
par_iter() was falling over for @pgarg66 when the rayon thread pool was exhausted by the size of the work queue and the work required IO... not sure if the last condition was required to make it fall over, though
@pgarg66, if you recall where you using the CUDA build when you ran into the rayon deadlocks before? I hope the answer is "No, was not using CUDA"!
Ok so ^ was a wrong turn.
What actually appears to be happening is that bench-tps
shovels a truckload of transactions at ed25519_verify_cpu()
, and then expects the network to advance to a new last_id
within 30 seconds. But in the meantime, ed25519_verify_cpu()
is chugging along and taking 60 seconds or more to verify them all. I believe the right fix is to split up the pending unverified transactions better so that the network can continue to advance in parallel to completing the full verification. Changes needed to both sigverify_stage.rs
and bench-tps
for this.
NB: while this is currently only reproducible with a CPU verifier, the bug appears to be architectural so if you shovel enough transactions at a GPU verifier the same issue will occur.
Doesn't seem to reproduce as often when I run
bench-tps
locally against the same net. GCP throttling the testnet client instance might have something to do with it