Ledger purging causes severe stall, leaving uncleaned bloated snapshots even across reboots

leoluk commented 3 years ago

Our tds validator Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN got stuck for about 40 minutes, from slot 56925460 to 56925591:

I0105 10:37:19.471441       1 slots.go:116] confirmed slot 56925459 (offset 241203, +3), epoch 144 (from slot 56684256 to 57116256, 190797 remaining)
I0105 10:37:20.467172       1 slots.go:116] confirmed slot 56925460 (offset 241204, +1), epoch 144 (from slot 56684256 to 57116256, 190796 remaining)
I0105 10:37:21.470273       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:22.477470       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:23.470636       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:24.467110       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:25.466756       1 slots.go:112] slot has not advanced at 56925460, skipping
[rip]
I0105 11:14:05.472919       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 11:14:06.466652       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 11:14:07.466721       1 slots.go:116] confirmed slot 56925591 (offset 241335, +131), epoch 144 (from slot 56684256 to 57116256, 190665 remaining)
I0105 11:14:08.467120       1 slots.go:116] confirmed slot 56925597 (offset 241341, +6), epoch 144 (from slot 56684256 to 57116256, 190659 remaining)
I0105 11:14:09.466760       1 slots.go:116] confirmed slot 56925605 (offset 241349, +8), epoch 144 (from slot 56684256 to 57116256, 190651 remaining)

Validator node log: http://logbin.certus.one:8080/Qui9iehu/solana-github-14434/Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN.log

https://metrics.solana.com:3000/d/monitor/cluster-telemetry?orgId=2&from=1609841617832&to=1609846954674&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=tds&var-hostid=Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN

And from our on-chain monitoring:

Looks like something just locked up completely? Suspiciously similar to #14410, except it's not 1.5.0.

leoluk commented 3 years ago

Happened again, on 1.5.1 this time:

https://metrics.solana.com:3000/d/monitor/cluster-telemetry?orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=tds&var-hostid=Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN&from=1610138796958&to=1610144082440&refresh=60s

Log: http://logbin.certus.one:8080/Qui9iehu/solana-github-14434/Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN-2.log

leoluk commented 3 years ago

Can confirm it still happens on 1.5.2.

ryoqun commented 3 years ago

Happened again, on 1.5.1 this time:

https://metrics.solana.com:3000/d/monitor/cluster-telemetry?orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=tds&var-hostid=Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN&from=1610138796958&to=1610144082440&refresh=60s

Log: http://logbin.certus.one:8080/Qui9iehu/solana-github-14434/Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN-2.log

@leoluk Thanks for detailed reporting. Ok, I've spotted the root cause:

The problem is rocksdb's ledger pruning (including compaction) is taking so much time and resource:

This is your validator's metrics with slightly adjusted timescale (in UTC)

https://metrics.solana.com:3000/d/monitor/cluster-telemetry?orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=tds&var-hostid=Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN&from=1610130868735&to=1610152010663&panelId=70

(My customize metrics view: https://metrics.solana.com:3000/d/V5LPmn_Zk/testnet-monitor-edge-ryoqun?orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=tds&var-hostid=Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN&from=1610130868735&to=1610152010663)

ryoqun commented 3 years ago

The problem is four-fold like a cancer..:

STAGE-I: It causes system stall firstly as reported originally here. STAGE-II: And then, it stops AccountBackgroundCleaning service, creating huge accounts dir STAGE-III: Then, a bloat snapshot can be desperately created under such a high load, including (uncleaned) unshrunken appendvec.. STAGE-IV: Finally, this makes some validators stuck unless removing their ledger dir altogether because just restarting creates 300G accounts dir even they rm -rf accounts manually....

ryoqun commented 3 years ago

For STAGE-II:

time is utc

https://metrics.solana.com:3000/d/V5LPmn_Zk/testnet-monitor-edge-ryoqun?panelId=72&orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=mainnet-beta&var-hostid=7cVfgArCheMR6Cs4t6vz5rfnqd56vZq4ndaBrY5xkxXy&from=1609899893632&to=1609909676181

https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?orgId=2&from=1609899893632&to=1609909676181&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=mainnet-beta&var-hostid=7cVfgArCheMR6Cs4t6vz5rfnqd56vZq4ndaBrY5xkxXy

ryoqun commented 3 years ago

For STAGE-III:

https://metrics.solana.com:3000/d/V5LPmn_Zk/testnet-monitor-edge-ryoqun?panelId=72&orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=mainnet-beta&var-hostid=7cVfgArCheMR6Cs4t6vz5rfnqd56vZq4ndaBrY5xkxXy&from=1609899893632&to=1609909676181

brianlong commented 3 years ago

@ryoqun What log level is required to send the "Ledger Pruning" data to the metrics server? I am currently using 'info', but I don't see that data on https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?refresh=60s&orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=mainnet-beta&var-hostid=DDnAqxJVFo2GVTujibHt5cjevHMSE9bo8HJaydHoshdp

ryoqun commented 3 years ago

it looks like our node also experienced this recently:

https://metrics.solana.com:3000/d/V5LPmn_Zk/testnet-monitor-edge-ryoqun?orgId=2&from=1610120089093&to=1610153930819&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=mainnet-beta&var-hostid=ba1cUvbuGe7k2Um2YKmRnR7ZAy3sK9mQKJkMCGVhcov

https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?orgId=2&from=1610128549525&to=1610145470388&panelId=70&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=mainnet-beta&var-hostid=ba1cUvbuGe7k2Um2YKmRnR7ZAy3sK9mQKJkMCGVhcov

ryoqun commented 3 years ago

@ryoqun What log level is required to send the "Ledger Pruning" data to the metrics server? I am currently using 'info', but I don't see that data on https://metrics.solana.com:3000/d/monitor-edge/cluster-telemetry-edge?refresh=60s&orgId=2&var-datasource=Solana%20Metrics%20(read-only)&var-testnet=mainnet-beta&var-hostid=DDnAqxJVFo2GVTujibHt5cjevHMSE9bo8HJaydHoshdp

@brianlong hehe, you have very good antenna. :) Well, INFO should be enough. Or, are you specifying --no-rocksdb-compaction? In that case, compaction won't happen.

ryoqun commented 3 years ago

@carllin @sakridge: ^^ I know the reason now. But, do you have any idea for quick fix? Disable compaction?

CC: @t-nelson (this is supposedly the reason we're observing 300G accounts dir sometimes.)

ryoqun commented 3 years ago

What bothering me is why mere rocks compaction or ledger purging is causing the system trouble badly like this. our ba1's accounts dir is tmpfs which is different from the fs rocksdb is residing....

But, the time correlation is so strong?

brianlong commented 3 years ago

@ryoqun I don't have enough storage for --no-rocksdb-compaction ;-)

This may be unrelated, but I did slow down the snapshot frequency with --snapshot-interval-slots 500 and that helped performance in general.

ryoqun commented 3 years ago

@ryoqun I don't have enough storage for --no-rocksdb-compaction ;-)

I see. Hmm, then I don't know immediate clue for no metrics for you. Maybe, your rocksdb is well under --limit-ledger-size?

This may be unrelated, but I did slow down the snapshot frequency with --snapshot-interval-slots 500 and that helped performance in general.

Yeah increasing interval should improve performance. I don't think increased interval relates to this bug's frequency, though. (I know, I have to work on your nice proposal #14216; so many fires...)

brianlong commented 3 years ago

I see. Hmm, then I don't know immediate clue for no metrics for you. Maybe, your rocksdb is well under --limit-ledger-size?

That could be the case -- I recently had to trash my old ledger after seeing the same behavior noted higher on this issue. I'll keep an eye on it and report back.

sakridge commented 3 years ago

@ryoqun This issue is originally from testnet, we are not seeing these huge accounts directories there.

Why does it stall the cleaning service? System is out of IOps?

sakridge commented 3 years ago

This is interesting though: https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#avoiding-stalls

sakridge commented 3 years ago

https://docs.rs/rocksdb/0.15.0/rocksdb/struct.Env.html#method.set_high_priority_background_threads

ryoqun commented 3 years ago

Why does it stall the cleaning service? System is out of IOps?

Yeah, this is a bit odd part; I'm guessing out of IOps or like that as well. Or as your priority pr does, ABS could be stalled by the already stalled rocksdb? possibly via banking/replaying stage? (Because this issue seems to stall everything)

ryoqun commented 3 years ago

.. we are not seeing these huge accounts directories there.

I think this is hardware-configuration dependant or tx transaction pattern difference?

@sakridge oh, notice that tds's appendVec is usually completely reclaimed unlike mainnet. That's because there is no kin migration there.. if there is no alive account at all for given recent slot, validator doesn't need shrinking to prevent accounts dir bloat. Hard system interactions. xD

leoluk commented 3 years ago

Why does it stall the cleaning service? System is out of IOps?

Yeah, this is a bit odd part; I'm guessing out of IOps or like that as well. Or as your priority pr does, ABS could be stalled by the already stalled rocksdb? possibly via banking/replaying stage? (Because this issue seems to stall everything)

In our case, disk wasn't a bottleneck and iowait remained close to zero.

ryoqun commented 3 years ago

@leoluk thanks for the report; then, I think this stall could be internally and artificially caused by rocksdb firstly, then our AccountsDB varisous locks start acting up.

sakridge commented 3 years ago

Why does it stall the cleaning service? System is out of IOps?

Yeah, this is a bit odd part; I'm guessing out of IOps or like that as well. Or as your priority pr does, ABS could be stalled by the already stalled rocksdb? possibly via banking/replaying stage? (Because this issue seems to stall everything)

Everything was stalled in this case because the broadcast insert_shred was holding the blockstore lock for the whole time so new shreds were not able to be inserted into blockstore. Accounts had nothing to do if there are no new shreds to consume. So I don't think it was stalled actually, it just had nothing to clean.

ryoqun commented 3 years ago

Log: http://logbin.certus.one:8080/Qui9iehu/solana-github-14434/Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN-2.log

I looked this:

[2021-01-08T21:50:14.438905685Z INFO  solana_metrics::metrics] datapoint: replay-loop-timing-stats total_elapsed_us=2413089000i compute_bank_stats_elapsed=6052i select_vote_and_reset_forks_elapsed=12i start_leader_elapsed=10894i reset_bank_elapsed=3534i voting_elapsed=2412471272i select_forks_elapsed=1i compute_slot_stats_elapsed=12i generate_new_bank_forks_elapsed=1486i replay_active_banks_elapsed=9319i re>

[2021-01-08T21:50:14.591364336Z INFO  solana_metrics::metrics] datapoint: shred_insert_is_full total_time_ms=2413407i slot=57486550i last_index=396i

Looks bad? @sakridge @carllin

sakridge commented 3 years ago

Log: http://logbin.certus.one:8080/Qui9iehu/solana-github-14434/Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN-2.log

I looked this:

[2021-01-08T21:50:14.438905685Z INFO  solana_metrics::metrics] datapoint: replay-loop-timing-stats total_elapsed_us=2413089000i compute_bank_stats_elapsed=6052i select_vote_and_reset_forks_elapsed=12i start_leader_elapsed=10894i reset_bank_elapsed=3534i voting_elapsed=2412471272i select_forks_elapsed=1i compute_slot_stats_elapsed=12i generate_new_bank_forks_elapsed=1486i replay_active_banks_elapsed=9319i re>

[2021-01-08T21:50:14.591364336Z INFO  solana_metrics::metrics] datapoint: shred_insert_is_full total_time_ms=2413407i slot=57486550i last_index=396i

Looks bad? @sakridge @carllin

Yes. The issue seems to be that the broadcast shred insert seemed to be stuck for a long while which then blocked the replay insert by way of the blockstore insert lock preventing the node from processing any new blocks. I think we should talk on the open issue #14586

solana-labs / solana

Ledger purging causes severe stall, leaving uncleaned bloated snapshots even across reboots #14434