Closed adizere closed 1 year ago
I used tm-load-test
against a tendermint-node
. I adjusted the RetainBlocks
parameter and found that the .tendermint/data
folder keeps on growing in size over time. It seems like the folder size plateaus around ~1.5 GB but would need further tests to confirm that without a doubt.
For example, after several minutes of running tm-load-test this is the data file breakdown
12K ./data//evidence.db
976K ./data//state.db
49M ./data//blockstore.db
1.0G ./data//cs.wal
444M ./data//tx_index.db
1.5G ./data/
I did a debug session with Callum and Lasaro. We found out that:
RetainBlocks
parameter only accounts for growth of state.db and blockstore.db. Regardless of RetainBlocks
, the state
file remains negligible in size in these experiments. With RetainBlocks
parameterized, the blockstore
file does change in size, for example:
RetainBlocks: 1
-> blockstore = ~ 40-50MB;RetainBlocks: 0
-> blockstore = ~ 140MB.To summarize, it's unlikely that the problem of pruning is due to Tendermint. There are couple of follow-ups to investigate:
I agree that it is most likely the application. Could we run the tm-load with gaia to check:
416 Oct 27 10:34 application.db
160 Oct 27 10:35 snapshots
387 Oct 27 10:37 priv_validator_state.json
Also, can we get the data folder space usage at different points in time from the osmosis node @mircea-c?
tm config below
Wait didn't we already know that the state.db was storing abci responses for all history and this was a culprit ? and i think this was fixed in a recent tendermint release? but not sure that would have made it out to cephs releases ? would need to ask thane
Also depending on what the issue is KV store might not be enough to eliminate tendermint as culprit (would need to ensure kv store and the tx load are flexing all degrees of freedom of tendermint/abci)
Data collected by Ceph team:
Copying here all findings from Slack for our future reference.
Anca:
tx_index.db
, state.db
from tendermint side. Also application.db
should not grow unbounded with state prunning. Could we also have the app.toml
and config.toml
?Shawn Rypstra:
Notes:
discard_abci_responses = false
param.
true
for validator/sentry nodes to save on storageWe found out that all different dbs seem to increase in size despite pruning-keep-recent = 500 / pruning-keep-every = 0 / pruning-interval = 100
We shifted attention from pruning parameters to compaction. It seems that the problem is that the golevelDB backend in Tendermint is not able to compact correctly.
[ ] @mircea-c to set discard_abci_responses = true
on validator/sentry nodes
experimental-compact-goleveldb
. We needed a couple of iterations to get used to that.Shawn: This is the recipe for both Gaia and Osmosis
check rpc status output
stop gaiad full node
check disk usage before pruning (tendermint experimental-compact-goleveldb)
prune database with `tendermint experimental-compact-goleveldb`
check disk usage after pruning
start gaiad full node
Anca: compiled the data into plots for osmosis (cosmos looks pretty much the same)
With the following setup:
configuration was done to keep last 100 application states, last 100 blocks and tendermint states, prune at every height:
"evidence": {
"max_age_num_blocks": "100",
"max_age_duration": "500000000000",
pruning = "custom"
pruning-keep-recent = "100"
pruning-keep-every = "0"
pruning-interval = "1"
min-retain-blocks = 100
discard_abci_responses = true
I wrote a script that gets du
(kbytes) at every height and compacts every N blocks, stores data in .csv files. Then I ploted them with R.
Here are a couple of results:
My summary at this point:
Notes: I modified the tendermint experimental CLI to compact also application and tx_index. Scripts and some minimal info here
Last week, @ancazamfir, @thanethomson and I discussed next steps to continue this investigation.
One thing we decided was to get a better understanding of the goLevelDb behaviour in a "vanilla" environment, i.e., separately from Tendermint's manipulation of the dababase, pruning configuration, etc. Specifically, we decided to write a simple script that inserts and deletes records from a golevelDb, while observing how the size of the db fluctuates in time with the different insert/delete steps.
The result is here: https://github.com/tendermint/tendermint/commit/e60eca5169bbffe20b133a5e35c0b7ce457957cf
The summary can be seen in this output
Steps:
name size (kb) records #
{ initial 0 0}
{ insert 1289 10000}
{ delete 1562 1000}
{ insert 2851 11000}
{ delete 3123 2000}
{ delete 3184 0}
Preliminary observations:
size (kb)
columnThanks for this @adizere! Would it be possible to test with rather setting larger blobs of data but more infrequently?
We still have an open issue in Tendermint to quantify the workloads encountered by a production node precisely (https://github.com/tendermint/tendermint/issues/9773), but from what I've gathered so far, the general write pattern in Tendermint is to dump very large objects to the store once every couple of seconds.
I have a feeling this may end up influencing the way that LevelDB reserves space in its internal data structures.
I am looking at storage workloads for Tendermint and came back to this thread. While the tx_index.db
relative growth rate is not very large, it seems to me that the overall % of storage taken by it is. I will double check this in the code (took a quick look now) but, if we prune blocks, we do not seem to prune the index. The index stores the following:
txindex
- the transaction hash of every transaction + for every height all transaction hashes + for every attribute of every event within a transaction - it's attributes (attribute key, value, height, index flag).blockindex
- 1 int64 per height + per event attribute 1 string containing attribute key, value, height)Note that the tendermint kvstore application does not generate block events. The indexer patch for 0.37.1 enables event generation if a flag is set.
Also, I understand that this itself may not be related to the problem of why compaction is not working but I think it relates to the discussion here.
thanks @jmalicevic! I think the issue title might be wrong. imo we are trying to figure out why we cannot have relatively constant disk usage over time. A good product should be able to run for a long time without running out of memory. But what we see:
tx_index.db
pruning doesn't happen (as you point out)application.db
, state.db
, blockchain.db
and tx_index.db
, single state size seems to grow over time.On the last point, I did some gaia tests with no transactions (empty blocks) and used 100 size pruning window for both application state and blocks.
One would expect that the disk space of states [H, H+100)
to stay relatively constant as H increases. But this is what we see:
application.db
and tx_index.db
grow at higher speed and hard to blame it all on the compaction issues
tx_index.db
the growth rate is in fact pretty large and overtime it reaches the size of application.db
so I think it's important to fix this if possible.application.db
(with pruning seemingly working) grows at a rate that is not justified by leveldb compaction issues.blockchain.db
and state.db
increase rate is smaller but still a concern; it could be justified by leveldb inefficiencies. @thanethomson brought up the fact that a state attribute might naturally grow in size over higher heights. height
is one example and therefore it is inevitable to see some state size increase over time.
Mircea:
Greg:
Thane, Adi: