solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.35k stars 4.35k forks source link

bench-tps fails periodically on testnet with "Error: Couldn't get new last id!" #764

Closed mvines closed 6 years ago

mvines commented 6 years ago

Doesn't seem to reproduce as often when I run bench-tps locally against the same net. GCP throttling the testnet client instance might have something to do with it

mvines commented 6 years ago

765 will hopefully surface information that might help debug this

mvines commented 6 years ago

Seems ok now, restarting the VM probably helped get past the GCP throttling (until it gets tagged again). Leaving this issue open for now to monitor.

image

mvines commented 6 years ago

Issue has been observed again

sakridge commented 6 years ago

We could send some txs in that loop to poll for get_last_id to force the server to create a new one. Another fix could be to lower the tx_count that might help.

mvines commented 6 years ago

I bet that once #718 lands and validators vote once a second regardless this'll get better

mvines commented 6 years ago

I've not seen this reproduce recently. Opportunistically closing until it reoccurs

mvines commented 6 years ago

This is reproducing on the testnet when CUDA is disabled.

mvines commented 6 years ago

I've had some success reproducing this issue locally, and have discovered I cannot reproduce the issue when I swap this .par_iter() for a normal .iter(): https://github.com/solana-labs/solana/blob/313fed375cd1ff631ac1511cf444cf1aeeda0e81/src/banking_stage.rs#L63

Essentially it looks like the leader's banking_stage is getting stuck while deserializing packets, which causes last_id to stall and eventually time out bench-tps.

Could this be a rayon bug?

mvines commented 6 years ago

We're on rayon 1.0.0 and 1.0.2 is available, but upgrading to 1.0.2 didn't help.

rob-solana commented 6 years ago

par_iter() was falling over for @pgarg66 when the rayon thread pool was exhausted by the size of the work queue and the work required IO... not sure if the last condition was required to make it fall over, though

mvines commented 6 years ago

Reproduces 100% with:

export RAYON_LOG=1
export RAYON_NUM_THREADS=1

Reproduces 0% (4/4 attempts) with:

export RAYON_LOG=1
export RAYON_NUM_THREADS=128
mvines commented 6 years ago

Using https://raw.githubusercontent.com/rayon-rs/rayon/master/scripts/analyze.sh on a run with RAYON_NUM_THREADS=1:,

$ bash analyze.sh log-leader.txt 
Injected jobs: 32
Joins:         312
Popped jobs:   310 = rhs 310 + other 0
Found work:    12
Stole work:    0
Uninjected:    12
Join balance:  2
Inj. balance:  20
Total balance: -22

I also see that FoundWork { worker: 0, yields: 0 } log messages halt after the stuck par_iter().

mvines commented 6 years ago

The other user of .par_iter() is https://github.com/solana-labs/solana/blob/313fed375cd1ff631ac1511cf444cf1aeeda0e81/src/sigverify.rs#L82-L92, which is interesting because it's a nested par_iter(). I bet that's not common...

mvines commented 6 years ago

Removing the outer into_par_iter() prevents the deadlock

mvines commented 6 years ago

par_iter() was falling over for @pgarg66 when the rayon thread pool was exhausted by the size of the work queue and the work required IO... not sure if the last condition was required to make it fall over, though

@pgarg66, if you recall where you using the CUDA build when you ran into the rayon deadlocks before? I hope the answer is "No, was not using CUDA"!

mvines commented 6 years ago

Ok so ^ was a wrong turn.

What actually appears to be happening is that bench-tps shovels a truckload of transactions at ed25519_verify_cpu(), and then expects the network to advance to a new last_id within 30 seconds. But in the meantime, ed25519_verify_cpu() is chugging along and taking 60 seconds or more to verify them all. I believe the right fix is to split up the pending unverified transactions better so that the network can continue to advance in parallel to completing the full verification. Changes needed to both sigverify_stage.rs and bench-tps for this.

mvines commented 6 years ago

NB: while this is currently only reproducible with a CPU verifier, the bug appears to be architectural so if you shovel enough transactions at a GPU verifier the same issue will occur.