Closed leoluk closed 3 years ago
Happened again, on 1.5.1 this time:
Can confirm it still happens on 1.5.2.
Happened again, on 1.5.1 this time:
@leoluk Thanks for detailed reporting. Ok, I've spotted the root cause:
The problem is rocksdb's ledger pruning (including compaction) is taking so much time and resource:
This is your validator's metrics with slightly adjusted timescale (in UTC)
(My customize metrics view: https://metrics.solana.com:3000/d/V5LPmn_Zk/testnet-monitor-edge-ryoqun?orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=tds&var-hostid=Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN&from=1610130868735&to=1610152010663)
The problem is four-fold like a cancer..:
STAGE-I: It causes system stall firstly as reported originally here.
STAGE-II: And then, it stops AccountBackgroundCleaning service, creating huge accounts dir
STAGE-III: Then, a bloat snapshot can be desperately created under such a high load, including (uncleaned) unshrunken appendvec..
STAGE-IV: Finally, this makes some validators stuck unless removing their ledger dir altogether because just restarting creates 300G accounts dir even they rm -rf accounts
manually....
For STAGE-II:
time is utc
@ryoqun What log level is required to send the "Ledger Pruning" data to the metrics server? I am currently using 'info', but I don't see that data on https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?refresh=60s&orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=mainnet-beta&var-hostid=DDnAqxJVFo2GVTujibHt5cjevHMSE9bo8HJaydHoshdp
it looks like our node also experienced this recently:
@ryoqun What log level is required to send the "Ledger Pruning" data to the metrics server? I am currently using 'info', but I don't see that data on https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?refresh=60s&orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=mainnet-beta&var-hostid=DDnAqxJVFo2GVTujibHt5cjevHMSE9bo8HJaydHoshdp
@brianlong
hehe, you have very good antenna. :) Well, INFO
should be enough. Or, are you specifying --no-rocksdb-compaction
? In that case, compaction
won't happen.
@carllin @sakridge: ^^ I know the reason now. But, do you have any idea for quick fix? Disable compaction?
CC: @t-nelson (this is supposedly the reason we're observing 300G accounts dir sometimes.)
What bothering me is why mere rocks compaction or ledger purging is causing the system trouble badly like this. our ba1's accounts dir is tmpfs which is different from the fs rocksdb is residing....
But, the time correlation is so strong?
@ryoqun I don't have enough storage for --no-rocksdb-compaction
;-)
This may be unrelated, but I did slow down the snapshot frequency with --snapshot-interval-slots 500
and that helped performance in general.
@ryoqun I don't have enough storage for
--no-rocksdb-compaction
;-)
I see. Hmm, then I don't know immediate clue for no metrics for you. Maybe, your rocksdb is well under --limit-ledger-size?
This may be unrelated, but I did slow down the snapshot frequency with
--snapshot-interval-slots 500
and that helped performance in general.
Yeah increasing interval should improve performance. I don't think increased interval relates to this bug's frequency, though. (I know, I have to work on your nice proposal #14216; so many fires...)
I see. Hmm, then I don't know immediate clue for no metrics for you. Maybe, your rocksdb is well under --limit-ledger-size?
That could be the case -- I recently had to trash my old ledger after seeing the same behavior noted higher on this issue. I'll keep an eye on it and report back.
@ryoqun This issue is originally from testnet, we are not seeing these huge accounts directories there.
Why does it stall the cleaning service? System is out of IOps?
This is interesting though: https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#avoiding-stalls
Why does it stall the cleaning service? System is out of IOps?
Yeah, this is a bit odd part; I'm guessing out of IOps or like that as well. Or as your priority pr does, ABS could be stalled by the already stalled rocksdb? possibly via banking/replaying stage? (Because this issue seems to stall everything)
.. we are not seeing these huge accounts directories there.
I think this is hardware-configuration dependant or tx transaction pattern difference?
@sakridge oh, notice that tds's appendVec is usually completely reclaimed unlike mainnet. That's because there is no kin migration there.. if there is no alive account at all for given recent slot, validator doesn't need shrinking to prevent accounts dir bloat. Hard system interactions. xD
Why does it stall the cleaning service? System is out of IOps?
Yeah, this is a bit odd part; I'm guessing out of IOps or like that as well. Or as your priority pr does, ABS could be stalled by the already stalled rocksdb? possibly via banking/replaying stage? (Because this issue seems to stall everything)
In our case, disk wasn't a bottleneck and iowait remained close to zero.
@leoluk thanks for the report; then, I think this stall could be internally and artificially caused by rocksdb firstly, then our AccountsDB varisous locks start acting up.
Why does it stall the cleaning service? System is out of IOps?
Yeah, this is a bit odd part; I'm guessing out of IOps or like that as well. Or as your priority pr does, ABS could be stalled by the already stalled rocksdb? possibly via banking/replaying stage? (Because this issue seems to stall everything)
Everything was stalled in this case because the broadcast insert_shred was holding the blockstore lock for the whole time so new shreds were not able to be inserted into blockstore. Accounts had nothing to do if there are no new shreds to consume. So I don't think it was stalled actually, it just had nothing to clean.
I looked this:
[2021-01-08T21:50:14.438905685Z INFO solana_metrics::metrics] datapoint: replay-loop-timing-stats total_elapsed_us=2413089000i compute_bank_stats_elapsed=6052i select_vote_and_reset_forks_elapsed=12i start_leader_elapsed=10894i reset_bank_elapsed=3534i voting_elapsed=2412471272i select_forks_elapsed=1i compute_slot_stats_elapsed=12i generate_new_bank_forks_elapsed=1486i replay_active_banks_elapsed=9319i re>
[2021-01-08T21:50:14.591364336Z INFO solana_metrics::metrics] datapoint: shred_insert_is_full total_time_ms=2413407i slot=57486550i last_index=396i
Looks bad? @sakridge @carllin
I looked this:
[2021-01-08T21:50:14.438905685Z INFO solana_metrics::metrics] datapoint: replay-loop-timing-stats total_elapsed_us=2413089000i compute_bank_stats_elapsed=6052i select_vote_and_reset_forks_elapsed=12i start_leader_elapsed=10894i reset_bank_elapsed=3534i voting_elapsed=2412471272i select_forks_elapsed=1i compute_slot_stats_elapsed=12i generate_new_bank_forks_elapsed=1486i replay_active_banks_elapsed=9319i re> [2021-01-08T21:50:14.591364336Z INFO solana_metrics::metrics] datapoint: shred_insert_is_full total_time_ms=2413407i slot=57486550i last_index=396i
Looks bad? @sakridge @carllin
Yes. The issue seems to be that the broadcast shred insert seemed to be stuck for a long while which then blocked the replay insert by way of the blockstore insert lock preventing the node from processing any new blocks. I think we should talk on the open issue #14586
Our tds validator Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN got stuck for about 40 minutes, from slot 56925460 to 56925591:
Validator node log: http://logbin.certus.one:8080/Qui9iehu/solana-github-14434/Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN.log
https://metrics.solana.com:3000/d/monitor/cluster-telemetry?orgId=2&from=1609841617832&to=1609846954674&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=tds&var-hostid=Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN
And from our on-chain monitoring:
Looks like something just locked up completely? Suspiciously similar to #14410, except it's not 1.5.0.