Actual ledger disk utilization grows past expected utilization

michaelh-laine commented 2 years ago

Problem

Last night starting at approximately 22:00 GMT (on 19 Feb 2022) our validator experienced a 100% full ledger disk using 1.8.16 (validator 1aine15iEqZxYySNwcHtQFt4Sgc75cbEi9wks8YgNCa).

I don't have usage data immediately prior to the event but the validator had been running 1.8.14 for a few weeks without restart or issue and never had a disk space issue previously. Ledger is on it's own partition with 564GB, --limit-ledger-size flag is set. Validator was upgraded on Friday the 18th of February at around 10:00 GMT. Metrics are being reported. Below is a screenshot somewhat indicative of the issue, it restarted several times, pruning some space and then crashing again, until the longer downtime was a full restart after clearing the ledger directory and downloading a new snapshot:

I don't have ledger disk metrics over time so I don't know what the previous usage was, perhaps it's just a coincidence and ledger usage has been creeping up in general and 564GB is insufficient with --limit-ledger-sizer now?

2022.07.17 (steviez): The below message captures problems: https://github.com/solana-labs/solana/issues/23248#issuecomment-1163785360

michaelh-laine commented 2 years ago

Logs from the first crash, pretty much as expected:

[2022-02-20T00:30:43.593504136Z ERROR solana_core::window_service] blockstore error: blockstore error
[2022-02-20T00:30:43.593508524Z ERROR solana_metrics::counter] COUNTER:{"name": "solana-window-insert-error", "counts": 3970, "samples": 3969,  "now": 1645317043593, "events": 1}
[2022-02-20T00:30:43.593512332Z ERROR solana_core::window_service] thread Some("solana-window-insert") error Blockstore(RocksDb(Error { message: "IO error: No space left on deviceWhile appending to file: /mnt/ledger/rocksdb/1017857.log: No space left on device" }))
[2022-02-20T00:30:43.593550614Z ERROR solana_core::window_service] blockstore error: blockstore error
[2022-02-20T00:30:43.593553339Z ERROR solana_metrics::counter] COUNTER:{"name": "solana-window-insert-error", "counts": 3971, "samples": 3970,  "now": 1645317043593, "events": 1}
[2022-02-20T00:30:43.593557607Z ERROR solana_core::window_service] thread Some("solana-window-insert") error Blockstore(RocksDb(Error { message: "IO error: No space left on deviceWhile appending to file: /mnt/ledger/rocksdb/1017857.log: No space left on device" }))
[2022-02-20T00:30:43.593688404Z ERROR solana_core::window_service] blockstore error: blockstore error
[2022-02-20T00:30:43.593691100Z ERROR solana_metrics::counter] COUNTER:{"name": "solana-window-insert-error", "counts": 3972, "samples": 3971,  "now": 1645317043593, "events": 1}
[2022-02-20T00:30:43.593693855Z ERROR solana_core::window_service] thread Some("solana-window-insert") error Blockstore(RocksDb(Error { message: "IO error: No space left on deviceWhile appending to file: /mnt/ledger/rocksdb/1017857.log: No space left on device" }))
thread 'solana-cost-update-service' panicked at 'persist program costs to blockstore: RocksDb(Error { message: "IO error: No space left on deviceWhile appending to file: /mnt/ledger/rocksdb/1017857.log: No space left on device" })', core/src/cost_update_service.rs:218:18
stack backtrace:
[2022-02-20T00:30:43.593962262Z ERROR solana_core::window_service] blockstore error: blockstore error
[2022-02-20T00:30:43.593967001Z ERROR solana_metrics::counter] COUNTER:{"name": "solana-window-insert-error", "counts": 3973, "samples": 3972,  "now": 1645317043593, "events": 1}
[2022-02-20T00:30:43.593970718Z ERROR solana_core::window_service] thread Some("solana-window-insert") error Blockstore(RocksDb(Error { message: "IO error: No space left on deviceWhile appending to file: /mnt/ledger/rocksdb/1017857.log: No space left on device" }))
[2022-02-20T00:30:43.594035751Z ERROR solana_core::window_service] blockstore error: blockstore error
--
[2022-02-20T00:30:43.618785693Z ERROR solana_core::window_service] blockstore error: blockstore error
[2022-02-20T00:30:43.618796634Z ERROR solana_metrics::counter] COUNTER:{"name": "solana-window-insert-error", "counts": 4054, "samples": 4053,  "now": 1645317043618, "events": 1}
[2022-02-20T00:30:43.618804749Z ERROR solana_core::window_service] thread Some("solana-window-insert") error Blockstore(RocksDb(Error { message: "IO error: No space left on deviceWhile appending to file: /mnt/ledger/rocksdb/1017857.log: No space left on device" }))

steviez commented 2 years ago

Thanks for reporting this; would have been interesting to poke around the contents of the ledger but understandable that you needed to wipe and move on. Don't really have any ideas off the bat but some quick questions

Ledger is on it's own partition with 564GB

Are snapshots also being stored on this partition, or are those somewhere else with --snapshot flag?

Ledger is on it's own partition with 564GB, --limit-ledger-size flag is set

To confirm, you're using the flag without an argument (which will use the default of 200_000_000) ?
No added/removed validator arguments when you bumped to v1.8.16 right?
Have you been running ledger-tool at all while the validator was running? Some of the commands may cause some "fallback" directories to avoid the ledger-tool from conflicting with the validator. I'm actually not sure offhand if these get cleaned up or what kind of footprint these would have on disk; I'll double check later

michaelh-laine commented 2 years ago

Snapshots are on a different (OS) partition, only Ledger (excl accounts and snapshots) on that partition
correct
no change in start script
no, definitely not in the past few weeks, might have a month or so ago, but don't recall now, however the disk space wasn't an issue before this weekend

steviez commented 2 years ago

Got it on all of the questions; figured this was the case but definitely like to double check the easy things first.

As I'm sure you're aware, snapshot size has been creeping up which is why I asked about those but that's obviously not the culprit here.

Poking around your metrics, there was a spike in the number of shreds inserted (or at least data shreds) around your restart time:

This coincides with an increase of reported disk usage; the two traces are disk usage before and after LedgerCleanupService::cleanup_ledger() runs. Given the current implementation, these two values will non-intuitively be pretty similar (more on that later).

The ledger cleanup service works by checking the number of shreds currently in the blockstore, and purging enough to get under that 200M (or whatever is configured) threshold. Those shreds are no longer visible in the blockstore, but they are still on disk until a RocksDB background process performs compaction (which is when the disk space gets reclaimed). Compaction is tricky because it can be I/O intensive; in attempt to avoid intense bursts of I/O, we configure the background compactions to run after an individual file hits a certain age (more details here).

So, my understanding is that your ledger was at a nice steady state where insertion / cleanup rate were essentially equal (that nice flat period of ledger utilization prior to your upgrade). Then, after your upgrade, the higher insertion rate outweighed the cleanup rate, and more specifically, the ledger was getting stuffed with relatively "young" data that even if our ledger cleanup service had purged chunks, the RocksDB background compaction process wouldn't have reclaimed it.

We have some more sophisticated changes in the works, but those are definitely not in v1.8 timeframe and likely not even v1.9 timeframe. I might dig in a little more to see if there is anything quick that could mitigate; otherwise, configuring a smaller number than the 200_000_000 would leave you more buffer and "probably" be able to withstand bursts like this. The cause of the burst would be something nice to understand as well ...

FYI @yhchiang-sol

michaelh-laine commented 2 years ago

Thanks for the very comprehensive response. I spoke to another validator who upgraded to 1.8.16 and they were seeing ledger sitting steady between 560-590GB, so it definitely seems like a case of "limited inflation" rather than "unlimited" which would've been more worrisome.

I've already reconfigured my start script to set the ledger limit to 100m shreds, as mentioned I downgraded to 1.8.14 and wiped the ledger so for now it's still doing fine (running with the default 200m, 36 hours since then it's now on 294G used), will keep monitoring it as it definitely sounds like something that could've happened on 1.8.14 just as well and isn't version specific.

Good chance I'll upgrade in the next few days with the 100m before I hit the max 200m shreds cap anyway though.

steviez commented 2 years ago

I spoke to another validator who upgraded to 1.8.16 and they were seeing ledger sitting steady between 560-590GB, so it definitely seems like a case of "limited inflation" rather than "unlimited" which would've been more worrisome.

Very interesting; think I'm going to try to reproduce this locally with a node running v1.8.16; our code comments claim to try to keep under 400 GB so 560-590 GB would be upwards of a 40% increase. Pretty sure this is unexpected

steviez commented 2 years ago

I was able to reproduce this on one of my nodes; I'm at ~485 GB right now but I think I may start leveling out given that I've hit the point where LedgerCleanupService started marking slots for delete:

I should have taken a look at the commit history initially; https://github.com/solana-labs/solana/pull/21214 getting backported about 3 weeks ago is definitely the culprit. This one caught me off guard since the change to master went in months ago but just got marked for backport recently.

So, we'll have to adjust the default number to account for coding shreds being present. Additionally, I think one of the blockstore insertion metrics is a little confusing at the moment so will look at that as well. Some PR's will be coming soon after a little bit more analysis.

t-nelson commented 2 years ago

Revert it, IMO :axe:

If someone wants it on MB, they can join the 1.9 stabilization effort

steviez commented 2 years ago

Summing up some conversation from Discord, we're not going to revert the change / apply a fix as 1.8 should hopefully be going away. Conversation here: https://discord.com/channels/428295358100013066/910937142182682656/946183375746641980

The recommendation stands that if you hit this on v1.8, use the following:

-limit-ledger-size 100000000

Recommendation for v1.9 TBD

michaelh-laine commented 2 years ago

Hey folks just a follow up on this, there've been at least half a dozen validators on Discord running into issues with this as everyone expects --limit-ledger-size with default to use 400-500gb. I think the docs should be updated at least if you don't intend on changing anything here (what about changing default value of limit-ledger-size to 150m as another way to approach this?)

steviez commented 2 years ago

Hi @michaelh-laine - Thanks for the nudge, to be honest this one kind of fell off my radar 😓 . I had a fix (linked PR) and then discussion here convinced me of another way to go. That other path (clear coding shreds early) shouldn't be too bad, let me try to knock that out today or tomorrow

michaelh-laine commented 2 years ago

Hi @steviez ok awesome thanks so the goal is to still maintain 400gb (or 500?) with that flag set? Then I can just inform validators on discord if it comes up again that this will still be the case in a future release

steviez commented 2 years ago

maintain 400gb (or 500?) with that flag set?

Correct, given that the docs should have higher visibility than code comment, I'm leaning towards 500 GB number from docs instead of 400 GB number from code comment (and I'll update the comment too)

Then I can just inform validators on discord if it comes up again that this will still be the case in a future release

Awesome; I'll link this issue and my PR when I get it posted. At the moment, I'm thinking it would make sense to backport to v1.9 and v1.10 as well, so I'll likely push for that

michaelh-laine commented 2 years ago

@steviez @t-nelson we've seen a massive degradation in free disk space on our ledger disk in the past 24 hours, with --limit-ledger-size 100000000 \ we're currently using 496GB disk space, of this rocksdb takes up 416G.

Just to remind, as before, snapshots are on a different disk. Here's the top level directories and their usage on the ledger disk: $ du -hd 1 4.0K ./accounts.ledger-tool 73G ./accounts 4.0K ./tmp-genesis 416G ./rocksdb 4.0K ./snapshot.ledger-tool 12G ./calculate_accounts_hash_cache

We're now at 7% remaining free disk space, so this is becoming critical where we may need a restart to resolve this issue as I'm not sure what I can safely delete from the rocksdb folder should it become necessary. Some graphs below (free disk space over 7 days and 4 hours, last restart was on 22 March, so this issue has cropped out suddenly where the validator was stable for about two weeks prior to the past 24 hours, on 1.9.13):

(Edited to replace mobile screenshot graphs with better ones)

Updated graph at 04:02 GMT, seems to be stabilizing:

Update at 05:26 GMT, still recovering:

michaelh-laine commented 2 years ago

One more graph just to show the severity of this event, something caused massive uncontained (well, almost) growth in rocksdb for a 25 hour period, then it all went away again as if nothing happened. No validator restart, no downtime at all.

yhchiang-sol commented 2 years ago

There're several possibilities in my mind but I probably need to check the rocksdb LOG to see what exactly happened.

Btw, the newly introduced hidden arguments in v1.10 --rocksdb-shred-compaction fifo together with --rocksdb-fifo-shred-storage-size <half-of-your-ledger-storage-size> should help cleaning up storage and reduce disk write bytes. But the option requires a clean restart of your validator. (meaning you will not be able to use the existing data in your ledger.)

steviez commented 2 years ago

Your metrics look fairly normal over this period

I see the reported live shred count (ledger_disk_utilization.total_shreds) in line with your limit (100M)
I see that purge is running and deleting slots (blockstore-purge) so the cleanup process is running and you are still making roots
I see that the number of inserted shreds looks fairly constant in the leadup to this event (shred_insert_is_full), so it isn't like there was a massive influx that pushed you over

I have a fix for this issue currently running on a test validator (testing got blocked by something else that needed investigation); the validator still needs some more time for the ledger to fill up. Once the ledger fills up and assuming metrics / data look good, I'll shift that PR out of draft: https://github.com/solana-labs/solana/pull/24186

steviez commented 2 years ago

I ran two nodes for a while; one with tip of master and one with my change. Both showed larger than what I'd expect growth that I can't account for right now. My nodes were struggling to keep up with the cluster but they were still making roots which means the cleanup should have still been occurring.

This definitely needs more investigation, unfortunately, I have other things that need more immediate attention. I'll try to circle back to this in a week or two

michaelh-laine commented 2 years ago

Had another event early this morning, same pattern. On 1.9.15. Ledger usage grows continuously for about 24 hours then regresses

michaelh-laine commented 2 years ago

@steviez our ledger is up to 1.1TB with default limit ledger size on 1.10.25.

not sure if there's something on the network right now causing it. Looks like it's been hovering around 900GB but jumped up to over 1TB starting between 1800-1900 UTC on Monday the 13th of June (about 10 hours into this epoch).

Ledger disk usage for the past month (only moved to this server in that time hence the graph shape):

upgraded to 1.10.25 around 10am UTC on the 14th, so this trend started before this while on 1.10.24

I'm mentioning this because I've anecdotally observed an increase in validators on Discord running into issues with ledger disks running out of space as ledger consumes more than expected

yhchiang-sol commented 2 years ago

Thanks for reporting the recent cases, @michaelh-laine.

I haven't checked whether we have any recent changes that might increase the data rate, but overall rocksdb isn't smart enough to pick the right file to compact and reclaim storage based on the user-specified compaction filter. I can work on a solution that makes rocksdb smarter.

Meanwhile, if you're okay with restarting the validator without using the existing local ledger data, using FIFO compaction (by passing --rocksdb-shred-compaction fifo together with --rocksdb-fifo-shred-storage-size <size>) will also allow RocksDB to efficiently reclaim and limit its storage size. The suggested <size> is 50% of the ledger store size.

michaelh-laine commented 2 years ago

I just upgraded to .25 so not keen to restart now but will add those for the next one.

My Ledger disk is 2.8T so it's not an issue for me right now but obviously if most validators are anticipating 500G usage and provision for this and see 200%+ in usage it could lead to a lot of crashes which sucks for the cluster

steviez commented 2 years ago

@yhchiang-sol - While I realize that we are currently slammed looking at other things, I'm also hesitant to pass upgrade to FIFO as the "fix" here as I think we should at least understand the issue. Here are some metrics from host_id LA1NE; is this the node you're sharing data from @michaelh-laine ?

Here are the SST file sizes for him; notice that coding is taking up a significant chunk more than data:

While data shreds do have some zero padding that coding shreds don't, that proportion shouldn't be so big. Additionally, here are the total number of shreds (purple) and utilized disk as reported by ledger_cleanup_service:

Two points here:

The disk usage reported here seems to be at significant odds with Michael's data from above; maybe not the same node?
The purple trace is in the floor, but dividing total_disk / total_shreds is slightly more revealing:

The limit-ledger-size argument refers to the number of data shreds, and we estimate 2000 bytes per data shred. Even if we reclaimed disk space instantly, 2000 bytes is low given our data/coding shred ratio (which above graph being constantly above 2000 supports). Accounting for the delay in reclaiming disk would require some additional buffer on this number

yhchiang-sol commented 2 years ago

Thanks @steviez for digging into the cause of regression while I am making rocksdb work better with compaction filters on reclaiming disk space. Both are important and now we have both covered!

michaelh-laine commented 2 years ago

Hi yes LA1INE is the correct host.

Sorry this is rather embarrassing, I've got a 430G tarred ledger backup from when my node crashed with the durable nonce issue (pre network halt). So actual usage is about 700GB (rocksdb is 609G).

It's still higher than advertised but obviously much less significant.

However the root cause of me digging out this issue was the multiple reports on Discord from other validators, so I believe that still to be valid, either there needs to be well advertised change in what storage requirement to expect with --limit-ledger-size or a way to bring it back down sub 500G.

ljump12 commented 2 years ago

Hi, I'm a new validator starting with 1.10.25. I've been running for ~48 hours and have just run out of space. I'm using a stated 1tb harddrive which seems to have 835gb available.

Here's the size/contents of my ledger/rocksdb:

16K    ./lost+found
750G    ./rocksdb
4.0K    ./tmp-genesis
3.2G    ./snapshot
4.0K    ./remote
15G    ./calculate_accounts_hash_cache
835G    .

df -h

Filesystem      Size  Used Avail Use% Mounted on
udev            126G     0  126G   0% /dev
tmpfs            26G   16M   26G   1% /run
/dev/nvme0n1p3  875G   17G  814G   2% /
tmpfs           126G     0  126G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           126G     0  126G   0% /sys/fs/cgroup
/dev/nvme0n1p2  975M   94M  831M  11% /boot
tmpfs            26G     0   26G   0% /run/user/1000
/dev/nvme1n1p1  880G  835G     0 100% /mnt/ledger
tmpfs           128G   89G   40G  70% /mnt/ramdisk

and startup script:

solana-validator \
    --identity <my_identity> \
    --vote-account <my_vote_account> \
    --known-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 \
    --known-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \
    --known-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ \
    --known-validator CakcnaRDHka2gXyfbEd2d3xsvkJkqsLw2akB3zsN1D2S \
    --only-known-rpc \
    --ledger /mnt/ledger \
    --accounts /mnt/ramdisk \
    --rpc-port 8899 \
    --private-rpc \
    --dynamic-port-range 8000-8020 \
    --entrypoint entrypoint.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint2.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint3.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint4.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint5.mainnet-beta.solana.com:8001 \
    --expected-genesis-hash 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d \
    --wal-recovery-mode skip_any_corrupted_record \
    --enable-rpc-transaction-history \
    --limit-ledger-size

steviez commented 2 years ago

I've got a 430G tarred ledger backup from when my node crashed with the durable nonce issue (pre network halt). So actual usage is about 700GB (rocksdb is 609G).

No worries, at least we have a sane explanation for the discrepancy

However the root cause of me digging out this issue was the multiple reports on Discord from other validators, so I believe that still to be valid, either there needs to be well advertised change in what storage requirement to expect with --limit-ledger-size or a way to bring it back down sub 500G.

Agreed, this was a good bump, and @yhchiang-sol and I are both exploring some options / ideas

steviez commented 2 years ago

Thanks for reporting @ljump12; are you by chance reporting metrics? If so, would you mind sharing your identity so that I can look at data specific to your node? It would be interesting to see how that space is getting taken up, and v1.10.x has some metrics that enable this.

ljump12 commented 2 years ago

Thanks for reporting @ljump12; are you by chance reporting metrics? If so, would you mind sharing your identity so that I can look at data specific to your node? It would be interesting to see how that space is getting taken up, and v1.10.x has some metrics that enable this.

Hi, I was not reporting metrics. I have now restarted & am reporting metrics although I'm not sure if it will now be helpful.

With the restart I also set --limit-ledger-size 100000000

My validator ID is: sTonoJtz1UBQxmRaoMoDAnBA3n4P9gPJfYaHnPQM5Xw

steviez commented 2 years ago

For the sake of transparency, @yhchiang-sol and I had a good chat last week on the matter. We think there are several things at play and pursuing those. In terms of designing a better reclamation scheme, those are covered in https://github.com/solana-labs/solana/issues/26003 and I will refer you there.

To capture some previous points since this issue has lots of comments and has been open for quite a while:

As to what is taking up the additional space, we started preserving coding shreds with https://github.com/solana-labs/solana/pull/21214; at a 1:1 ratio with data shreds, this substantially throws off our estimate as limit-ledger-size argument counts the number of data shreds only. This was done to enable some features around coding shreds that I'm currently unaware of the status of.
In order to avoid bursty I/O, we mark data as dead in the foreground, and allow RocksDB to reclaim disk in the background (described in detail in a comment in Rocks::open() in blockstore_db.rs). However, a certain (key,value) pair may end up living in several files for its' lifetime as it moves to lower levels in RocksDB. These moves occur via compaction (see this diagram here for more detail). Every time compaction happens and merges old files, a new file is created with a young age. This phenomenon explains why even when the ledger hits limit-ledger-size value number of shreds, it continues to grow for a little bit until files that are in low enough levels begin reaching the age (1 day) where the background process will pick them up.
- This is a delicate balance; we could make this happen more frequently to reclaim disk quicker, but this would also mean more frequent wasted I/O on things that are not to be reclaimed. Above all, we don't want to introduce extra I/O that could limit the ability to insert shreds (and thus increase effective replay time). Some of the changes yhchiang-sol proposed aim to address this by trying to be more targeted with what the background process picks to clean.

We have some options that we are looking at / weighing; some potentially as simple as reducing the limit-ledger-size default value. However, doing that would mean fewer nodes will have older blocks, and catching up to the network from large distances would become more difficult (maybe impossible) if there aren't enough nodes to serve repair requests on the dated slots. So, even the "simple" options have some not-so-simple potential tradeoffs

michaelh-laine commented 2 years ago

@steviez just another gentle nudge here with a reference to another user on Discord that is confused by the docs stating 500G: https://discord.com/channels/428295358100013066/560174212967432193/996747549597769729

steviez commented 2 years ago

I think there is a good chance that https://github.com/solana-labs/solana/pull/26651 will help out significantly on this issue.

Some testing will reveal that it is 1) working correctly AND 2) confirm if the number in docs needs to be adjusted to allow more buffer (I suspect this may be the case)

steviez commented 2 years ago

On the next v1.13 release (v1.13.5), things should be significantly better with: https://github.com/solana-labs/solana/pull/28721

steviez commented 1 year ago

Closing this one out - docs have been updated and patches have been in for a bit

solana-labs / solana

Actual ledger disk utilization grows past expected utilization #23248

Problem