informalsystems / interchain

This repository is purely experimental. It is meant to track cross-stack issues. These are issues which we do not know where they belong (is it a Tendermint? an SDK? an IBC-go problem?) or which have multiple dependencies in different repositories, potentially across multiple organizations (cosmos, informalsystems).
0 stars 0 forks source link

Investigate why Tendermint disk storage keeps growing over time #1

Closed adizere closed 1 year ago

adizere commented 2 years ago

Mircea:

Pruning doesn’t work. Osmosis node grows by 8GB per day with prune=everything, and all the available pruning options.

Greg:

Run a chain with KV store, send transactions. Most likely an SDK issue. Use tm-load-test against KV-store-backed network. This would help eliminate Tendermint as a culprit.

Thane, Adi:

AS and TT will give it a try to do this. If it proves to be a SDK issue, we’ll be blocked. TT can help facilitate debugging, and would need someone from the team to fix, capture in an issue, shepherd.

adizere commented 2 years ago
adizere commented 2 years ago

I used tm-load-test against a tendermint-node. I adjusted the RetainBlocks parameter and found that the .tendermint/data folder keeps on growing in size over time. It seems like the folder size plateaus around ~1.5 GB but would need further tests to confirm that without a doubt.

For example, after several minutes of running tm-load-test this is the data file breakdown

 12K    ./data//evidence.db
976K    ./data//state.db
 49M    ./data//blockstore.db
1.0G    ./data//cs.wal
444M    ./data//tx_index.db
1.5G    ./data/

I did a debug session with Callum and Lasaro. We found out that:

  1. cs.wal grows up to 1GB and is the main culprit of large data consumption. But cs.wal does not grow beyond 1GB. The culprit for this large wal is the autofile library parameter of 1gb.
  2. The RetainBlocks parameter only accounts for growth of state.db and blockstore.db. Regardless of RetainBlocks, the state file remains negligible in size in these experiments. With RetainBlocks parameterized, the blockstore file does change in size, for example:
    • RetainBlocks: 1 -> blockstore = ~ 40-50MB;
    • RetainBlocks: 0 -> blockstore = ~ 140MB.

To summarize, it's unlikely that the problem of pruning is due to Tendermint. There are couple of follow-ups to investigate:

ancazamfir commented 2 years ago

I agree that it is most likely the application. Could we run the tm-load with gaia to check:

416 Oct 27 10:34 application.db
160 Oct 27 10:35 snapshots
387 Oct 27 10:37 priv_validator_state.json

Also, can we get the data folder space usage at different points in time from the osmosis node @mircea-c?

adizere commented 2 years ago

tm config below

~/.tendermint/config/config.toml ```toml # This is a TOML config file. # For more information, see https://github.com/toml-lang/toml # NOTE: Any path below can be absolute (e.g. "/var/myawesomeapp/data") or # relative to the home directory (e.g. "data"). The home directory is # "$HOME/.tendermint" by default, but could be changed via $TMHOME env variable # or --home cmd flag. ####################################################################### ### Main Base Config Options ### ####################################################################### # TCP or UNIX socket address of the ABCI application, # or the name of an ABCI application compiled in with the Tendermint binary proxy_app = "tcp://127.0.0.1:26658" # A custom human readable name for this node moniker = "nix" # If this node is many blocks behind the tip of the chain, FastSync # allows them to catchup quickly by downloading blocks in parallel # and verifying their commits fast_sync = true # Database backend: goleveldb | cleveldb | boltdb | rocksdb | badgerdb # * goleveldb (github.com/syndtr/goleveldb - most popular implementation) # - pure go # - stable # * cleveldb (uses levigo wrapper) # - fast # - requires gcc # - use cleveldb build tag (go build -tags cleveldb) # * boltdb (uses etcd's fork of bolt - github.com/etcd-io/bbolt) # - EXPERIMENTAL # - may be faster is some use-cases (random reads - indexer) # - use boltdb build tag (go build -tags boltdb) # * rocksdb (uses github.com/tecbot/gorocksdb) # - EXPERIMENTAL # - requires gcc # - use rocksdb build tag (go build -tags rocksdb) # * badgerdb (uses github.com/dgraph-io/badger) # - EXPERIMENTAL # - use badgerdb build tag (go build -tags badgerdb) db_backend = "goleveldb" # Database directory db_dir = "data" # Output level for logging, including package level options log_level = "info" # Output format: 'plain' (colored text) or 'json' log_format = "plain" ##### additional base config options ##### # Path to the JSON file containing the initial validator set and other meta data genesis_file = "config/genesis.json" # Path to the JSON file containing the private key to use as a validator in the consensus protocol priv_validator_key_file = "config/priv_validator_key.json" # Path to the JSON file containing the last sign state of a validator priv_validator_state_file = "data/priv_validator_state.json" # TCP or UNIX socket address for Tendermint to listen on for # connections from an external PrivValidator process priv_validator_laddr = "" # Path to the JSON file containing the private key to use for node authentication in the p2p protocol node_key_file = "config/node_key.json" # Mechanism to connect to the ABCI application: socket | grpc abci = "socket" # If true, query the ABCI app on connecting to a new peer # so the app can decide if we should keep the connection or not filter_peers = false ####################################################################### ### Advanced Configuration Options ### ####################################################################### ####################################################### ### RPC Server Configuration Options ### ####################################################### [rpc] # TCP or UNIX socket address for the RPC server to listen on laddr = "tcp://127.0.0.1:26657" # A list of origins a cross-domain request can be executed from # Default value '[]' disables cors support # Use '["*"]' to allow any origin cors_allowed_origins = [] # A list of methods the client is allowed to use with cross-domain requests cors_allowed_methods = ["HEAD", "GET", "POST", ] # A list of non simple headers the client is allowed to use with cross-domain requests cors_allowed_headers = ["Origin", "Accept", "Content-Type", "X-Requested-With", "X-Server-Time", ] # TCP or UNIX socket address for the gRPC server to listen on # NOTE: This server only supports /broadcast_tx_commit grpc_laddr = "" # Maximum number of simultaneous connections. # Does not include RPC (HTTP&WebSocket) connections. See max_open_connections # If you want to accept a larger number than the default, make sure # you increase your OS limits. # 0 - unlimited. # Should be < {ulimit -Sn} - {MaxNumInboundPeers} - {MaxNumOutboundPeers} - {N of wal, db and other open files} # 1024 - 40 - 10 - 50 = 924 = ~900 grpc_max_open_connections = 900 # Activate unsafe RPC commands like /dial_seeds and /unsafe_flush_mempool unsafe = false # Maximum number of simultaneous connections (including WebSocket). # Does not include gRPC connections. See grpc_max_open_connections # If you want to accept a larger number than the default, make sure # you increase your OS limits. # 0 - unlimited. # Should be < {ulimit -Sn} - {MaxNumInboundPeers} - {MaxNumOutboundPeers} - {N of wal, db and other open files} # 1024 - 40 - 10 - 50 = 924 = ~900 max_open_connections = 900 # Maximum number of unique clientIDs that can /subscribe # If you're using /broadcast_tx_commit, set to the estimated maximum number # of broadcast_tx_commit calls per block. max_subscription_clients = 100 # Maximum number of unique queries a given client can /subscribe to # If you're using GRPC (or Local RPC client) and /broadcast_tx_commit, set to # the estimated # maximum number of broadcast_tx_commit calls per block. max_subscriptions_per_client = 5 # Experimental parameter to specify the maximum number of events a node will # buffer, per subscription, before returning an error and closing the # subscription. Must be set to at least 100, but higher values will accommodate # higher event throughput rates (and will use more memory). experimental_subscription_buffer_size = 200 # Experimental parameter to specify the maximum number of RPC responses that # can be buffered per WebSocket client. If clients cannot read from the # WebSocket endpoint fast enough, they will be disconnected, so increasing this # parameter may reduce the chances of them being disconnected (but will cause # the node to use more memory). # # Must be at least the same as "experimental_subscription_buffer_size", # otherwise connections could be dropped unnecessarily. This value should # ideally be somewhat higher than "experimental_subscription_buffer_size" to # accommodate non-subscription-related RPC responses. experimental_websocket_write_buffer_size = 200 # If a WebSocket client cannot read fast enough, at present we may # silently drop events instead of generating an error or disconnecting the # client. # # Enabling this experimental parameter will cause the WebSocket connection to # be closed instead if it cannot read fast enough, allowing for greater # predictability in subscription behaviour. experimental_close_on_slow_client = false # How long to wait for a tx to be committed during /broadcast_tx_commit. # WARNING: Using a value larger than 10s will result in increasing the # global HTTP write timeout, which applies to all connections and endpoints. # See https://github.com/tendermint/tendermint/issues/3435 timeout_broadcast_tx_commit = "10s" # Maximum size of request body, in bytes max_body_bytes = 1000000 # Maximum size of request header, in bytes max_header_bytes = 1048576 # The path to a file containing certificate that is used to create the HTTPS server. # Might be either absolute path or path related to Tendermint's config directory. # If the certificate is signed by a certificate authority, # the certFile should be the concatenation of the server's certificate, any intermediates, # and the CA's certificate. # NOTE: both tls_cert_file and tls_key_file must be present for Tendermint to create HTTPS server. # Otherwise, HTTP server is run. tls_cert_file = "" # The path to a file containing matching private key that is used to create the HTTPS server. # Might be either absolute path or path related to Tendermint's config directory. # NOTE: both tls-cert-file and tls-key-file must be present for Tendermint to create HTTPS server. # Otherwise, HTTP server is run. tls_key_file = "" # pprof listen address (https://golang.org/pkg/net/http/pprof) pprof_laddr = "" ####################################################### ### P2P Configuration Options ### ####################################################### [p2p] # Address to listen for incoming connections laddr = "tcp://0.0.0.0:26656" # Address to advertise to peers for them to dial # If empty, will use the same port as the laddr, # and will introspect on the listener or use UPnP # to figure out the address. ip and port are required # example: 159.89.10.97:26656 external_address = "" # Comma separated list of seed nodes to connect to seeds = "" # Comma separated list of nodes to keep persistent connections to persistent_peers = "" # UPNP port forwarding upnp = false # Path to address book addr_book_file = "config/addrbook.json" # Set true for strict address routability rules # Set false for private or local networks addr_book_strict = true # Maximum number of inbound peers max_num_inbound_peers = 40 # Maximum number of outbound peers to connect to, excluding persistent peers max_num_outbound_peers = 10 # List of node IDs, to which a connection will be (re)established ignoring any existing limits unconditional_peer_ids = "" # Maximum pause when redialing a persistent peer (if zero, exponential backoff is used) persistent_peers_max_dial_period = "0s" # Time to wait before flushing messages out on the connection flush_throttle_timeout = "100ms" # Maximum size of a message packet payload, in bytes max_packet_msg_payload_size = 1024 # Rate at which packets can be sent, in bytes/second send_rate = 5120000 # Rate at which packets can be received, in bytes/second recv_rate = 5120000 # Set true to enable the peer-exchange reactor pex = true # Seed mode, in which node constantly crawls the network and looks for # peers. If another node asks it for addresses, it responds and disconnects. # # Does not work if the peer-exchange reactor is disabled. seed_mode = false # Comma separated list of peer IDs to keep private (will not be gossiped to other peers) private_peer_ids = "" # Toggle to disable guard against peers connecting from the same ip. allow_duplicate_ip = false # Peer connection configuration. handshake_timeout = "20s" dial_timeout = "3s" ####################################################### ### Mempool Configuration Option ### ####################################################### [mempool] # Mempool version to use: # 1) "v0" - (default) FIFO mempool. # 2) "v1" - prioritized mempool. version = "v0" recheck = true broadcast = true wal_dir = "" # Maximum number of transactions in the mempool size = 5000 # Limit the total size of all txs in the mempool. # This only accounts for raw transactions (e.g. given 1MB transactions and # max_txs_bytes=5MB, mempool will only accept 5 transactions). max_txs_bytes = 1073741824 # Size of the cache (used to filter transactions we saw earlier) in transactions cache_size = 10000 # Do not remove invalid transactions from the cache (default: false) # Set to true if it's not possible for any invalid transaction to become valid # again in the future. keep-invalid-txs-in-cache = false # Maximum size of a single transaction. # NOTE: the max size of a tx transmitted over the network is {max_tx_bytes}. max_tx_bytes = 1048576 # Maximum size of a batch of transactions to send to a peer # Including space needed by encoding (one varint per transaction). # XXX: Unused due to https://github.com/tendermint/tendermint/issues/5796 max_batch_bytes = 0 # ttl-duration, if non-zero, defines the maximum amount of time a transaction # can exist for in the mempool. # # Note, if ttl-num-blocks is also defined, a transaction will be removed if it # has existed in the mempool at least ttl-num-blocks number of blocks or if it's # insertion time into the mempool is beyond ttl-duration. ttl-duration = "0s" # ttl-num-blocks, if non-zero, defines the maximum number of blocks a transaction # can exist for in the mempool. # # Note, if ttl-duration is also defined, a transaction will be removed if it # has existed in the mempool at least ttl-num-blocks number of blocks or if # it's insertion time into the mempool is beyond ttl-duration. ttl-num-blocks = 0 ####################################################### ### State Sync Configuration Options ### ####################################################### [statesync] # State sync rapidly bootstraps a new node by discovering, fetching, and restoring a state machine # snapshot from peers instead of fetching and replaying historical blocks. Requires some peers in # the network to take and serve state machine snapshots. State sync is not attempted if the node # has any local state (LastBlockHeight > 0). The node will have a truncated block history, # starting from the height of the snapshot. enable = false # RPC servers (comma-separated) for light client verification of the synced state machine and # retrieval of state data for node bootstrapping. Also needs a trusted height and corresponding # header hash obtained from a trusted source, and a period during which validators can be trusted. # # For Cosmos SDK-based chains, trust_period should usually be about 2/3 of the unbonding time (~2 # weeks) during which they can be financially punished (slashed) for misbehavior. rpc_servers = "" trust_height = 0 trust_hash = "" trust_period = "168h0m0s" # Time to spend discovering snapshots before initiating a restore. discovery_time = "15s" # Temporary directory for state sync snapshot chunks, defaults to the OS tempdir (typically /tmp). # Will create a new, randomly named directory within, and remove it when done. temp_dir = "" # The timeout duration before re-requesting a chunk, possibly from a different # peer (default: 1 minute). chunk_request_timeout = "10s" # The number of concurrent chunk fetchers to run (default: 1). chunk_fetchers = "4" ####################################################### ### Fast Sync Configuration Connections ### ####################################################### [fastsync] # Fast Sync version to use: # 1) "v0" (default) - the legacy fast sync implementation # 2) "v1" - refactor of v0 version for better testability # 2) "v2" - complete redesign of v0, optimized for testability & readability version = "v0" ####################################################### ### Consensus Configuration Options ### ####################################################### [consensus] wal_file = "data/cs.wal/wal" # How long we wait for a proposal block before prevoting nil timeout_propose = "3s" # How much timeout_propose increases with each round timeout_propose_delta = "500ms" # How long we wait after receiving +2/3 prevotes for “anything” (ie. not a single block or nil) timeout_prevote = "1s" # How much the timeout_prevote increases with each round timeout_prevote_delta = "500ms" # How long we wait after receiving +2/3 precommits for “anything” (ie. not a single block or nil) timeout_precommit = "1s" # How much the timeout_precommit increases with each round timeout_precommit_delta = "500ms" # How long we wait after committing a block, before starting on the new # height (this gives us a chance to receive some more precommits, even # though we already have +2/3). timeout_commit = "1s" # How many blocks to look back to check existence of the node's consensus votes before joining consensus # When non-zero, the node will panic upon restart # if the same consensus key was used to sign {double_sign_check_height} last blocks. # So, validators should stop the state machine, wait for some blocks, and then restart the state machine to avoid panic. double_sign_check_height = 0 # Make progress as soon as we have all the precommits (as if TimeoutCommit = 0) skip_timeout_commit = false # EmptyBlocks mode and possible interval between empty blocks create_empty_blocks = true create_empty_blocks_interval = "0s" # Reactor sleep duration parameters peer_gossip_sleep_duration = "100ms" peer_query_maj23_sleep_duration = "2s" ####################################################### ### Storage Configuration Options ### ####################################################### [storage] # Set to true to discard ABCI responses from the state store, which can save a # considerable amount of disk space. Set to false to ensure ABCI responses are # persisted. ABCI responses are required for /block_results RPC queries, and to # reindex events in the command-line tool. discard_abci_responses = false ####################################################### ### Transaction Indexer Configuration Options ### ####################################################### [tx_index] # What indexer to use for transactions # # The application will set which txs to index. In some cases a node operator will be able # to decide which txs to index based on configuration set in the application. # # Options: # 1) "null" # 2) "kv" (default) - the simplest possible indexer, backed by key-value storage (defaults to levelDB; see DBBackend). # - When "kv" is chosen "tx.height" and "tx.hash" will always be indexed. # 3) "psql" - the indexer services backed by PostgreSQL. # When "kv" or "psql" is chosen "tx.height" and "tx.hash" will always be indexed. indexer = "kv" # The PostgreSQL connection configuration, the connection format: # postgresql://:@:/? psql-conn = "" ####################################################### ### Instrumentation Configuration Options ### ####################################################### [instrumentation] # When true, Prometheus metrics are served under /metrics on # PrometheusListenAddr. # Check out the documentation for the list of available metrics. prometheus = false # Address to listen for Prometheus collector(s) connections prometheus_listen_addr = ":26660" # Maximum number of simultaneous connections. # If you want to accept a larger number than the default, make sure # you increase your OS limits. # 0 - unlimited. max_open_connections = 3 # Instrumentation namespace namespace = "tendermint" ```
ebuchman commented 2 years ago

Wait didn't we already know that the state.db was storing abci responses for all history and this was a culprit ? and i think this was fixed in a recent tendermint release? but not sure that would have made it out to cephs releases ? would need to ask thane

Also depending on what the issue is KV store might not be enough to eliminate tendermint as culprit (would need to ensure kv store and the tx load are flexing all degrees of freedom of tendermint/abci)

ancazamfir commented 2 years ago

Data collected by Ceph team: osmo_hub_db_storage copy

adizere commented 1 year ago

Copying here all findings from Slack for our future reference.

October 27

Anca:

Shawn Rypstra:

Notes:

Anca's conclusion - Looked at the code, docs, data. blockstore.db, tx_index.db, application.db and state.db all grow over time regardless of the configuration and chain. What we are interested [here](https://twitter.com/valnodes/status/1508527814316011520) is the disk storage (vs cache/memory storage). There is a good high level summary here that explains the config parameters for the different types of pruning. For disk state, at every new block, state will be deleted if it’s under a certain height. How much is retained is dictated by: - min-retain-blocks in app.toml, - max_age_num_blocks and max_age_duration in genesis.json (evidence related). With min-retain-blocks = 0 in your app.toml this means that the state with height more than max_age_num_blocks behind the last block height, and with time older than max_age_duration relative to the last block time. This old data is actually deleted from the DBs but golevel compaction does not happen. There is a command available in v0.34.20 that you can run but you need to stop the node: tendermint experimental-compact-goleveldb. You can read some relevant detail (eg cosmos-pruner) in the PR description here https://github.com/tendermint/tendermint/pull/8564

Takeaways:

Nov 18

Full conversation - Anca: Could we run the script again but this time stop the full node, run `tendermint experimental-compact-goleveldb`, restart the full node and get the disk usage info? - Shawn: Yes. - This is similar to [cosmprund](https://github.com/binaryholdings/cosmprund) - It is great for nodes that you can stop and run it...but you cant stop a validator node for that long so it has very specific use cases - Anca: curious on how long it takes to compact one hour worth of garbage accumulation. - Shawn: Yes but it looks bad missing blocks, from a PR stand point we cant not sign blocks for any amount of time. - Current procedure is run a second node that you can prune the DB and then stop validator node, copy the pruned DB in and restart the validator - That way the down time is only how long it takes to copy the pruned DB over...which is minutes, not hours then - Shawn: So I ran `tendermint experimental-compact-goleveldb` (from tendermint version 0.34.21) on osmosis DB at 324GB, it ran for 2 minutes, barely reduced DB size and actually corrupted the DB when I tried to restart the node. ``` cat osmo-disk-before-prune.md 366810 /home/shawn/.osmosisd/data cat osmo-disk-after-prune.md 367265 /home/shawn/.osmosisd/data ``` Ran for 1min 52s

Takeaways:

November 25

Shawn: This is the recipe for both Gaia and Osmosis

check rpc status output
stop gaiad full node
check disk usage before pruning (tendermint experimental-compact-goleveldb)
prune database with `tendermint experimental-compact-goleveldb`
check disk usage after pruning
start gaiad full node

Anca: compiled the data into plots for osmosis (cosmos looks pretty much the same)

osmo-before-after-compact

Takeaways:

ancazamfir commented 1 year ago

With the following setup:

pruning = "custom"
pruning-keep-recent = "100"
pruning-keep-every = "0"
pruning-interval = "1"
min-retain-blocks = 100

I wrote a script that gets du (kbytes) at every height and compacts every N blocks, stores data in .csv files. Then I ploted them with R. Here are a couple of results:

My summary at this point:

Notes: I modified the tendermint experimental CLI to compact also application and tx_index. Scripts and some minimal info here

adizere commented 1 year ago

Last week, @ancazamfir, @thanethomson and I discussed next steps to continue this investigation.

One thing we decided was to get a better understanding of the goLevelDb behaviour in a "vanilla" environment, i.e., separately from Tendermint's manipulation of the dababase, pruning configuration, etc. Specifically, we decided to write a simple script that inserts and deletes records from a golevelDb, while observing how the size of the db fluctuates in time with the different insert/delete steps.

The result is here: https://github.com/tendermint/tendermint/commit/e60eca5169bbffe20b133a5e35c0b7ce457957cf

The summary can be seen in this output

Steps:
               name      size (kb)      records #
    {        initial               0               0}
    {         insert            1289           10000}
    {         delete            1562            1000}
    {         insert            2851           11000}
    {         delete            3123            2000}
    {         delete            3184               0}

Preliminary observations:

thanethomson commented 1 year ago

Thanks for this @adizere! Would it be possible to test with rather setting larger blobs of data but more infrequently?

We still have an open issue in Tendermint to quantify the workloads encountered by a production node precisely (https://github.com/tendermint/tendermint/issues/9773), but from what I've gathered so far, the general write pattern in Tendermint is to dump very large objects to the store once every couple of seconds.

I have a feeling this may end up influencing the way that LevelDB reserves space in its internal data structures.

jmalicevic commented 1 year ago

I am looking at storage workloads for Tendermint and came back to this thread. While the tx_index.db relative growth rate is not very large, it seems to me that the overall % of storage taken by it is. I will double check this in the code (took a quick look now) but, if we prune blocks, we do not seem to prune the index. The index stores the following:

Note that the tendermint kvstore application does not generate block events. The indexer patch for 0.37.1 enables event generation if a flag is set.

Also, I understand that this itself may not be related to the problem of why compaction is not working but I think it relates to the discussion here.

ancazamfir commented 1 year ago

thanks @jmalicevic! I think the issue title might be wrong. imo we are trying to figure out why we cannot have relatively constant disk usage over time. A good product should be able to run for a long time without running out of memory. But what we see:

On the last point, I did some gaia tests with no transactions (empty blocks) and used 100 size pruning window for both application state and blocks. One would expect that the disk space of states [H, H+100) to stay relatively constant as H increases. But this is what we see:

@thanethomson brought up the fact that a state attribute might naturally grow in size over higher heights. height is one example and therefore it is inevitable to see some state size increase over time.

adizere commented 1 year ago

Fixed Investigation continues in CometBFT