near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.31k stars 619 forks source link

Disable RocksDB `cache_index_and_filter_blocks` for State and FlatState columns #9319

Open pugachAG opened 1 year ago

pugachAG commented 1 year ago

Filter and index blocks are accessed by RocksDB before reading the actual data, see this wiki page for more information. This makes it critical that those blocks are cached, especially considering that filter and index blocks are generally much larger than data blocks: source.

Currently we set RocksDB cache_index_and_filter_blocks config option to true for all our columns via block_opts.set_cache_index_and_filter_blocks(true). State and FlatState block cache size is set via col_state_cache_size config with a default of 512MB.

The total filter and index block size is tracked by the RocksDB estimate-table-readers-mem property which is exposed in neard as near_rocksdb_estimate_table_readers_mem metric to prometheus (note that it requires disabling cache_index_and_filter_blocks, otherwise filter and index blocks are cached in the block cache). Current sizes on a mainnet node are ~800MB for State and ~300MB for FlatState.

Currently the size filter and index blocks for the State column is greater than the size of the block cache. This results in these blocks being constantly evicted from the block cache. Interestingly this doesn’t result in performance degradation since filter and index blocks are still cached in the disk page cache by the OS. But this is a pretty fragile setup as we can expect a significant performance degradation if the OS no longer has enough free RAM to cache pages for all State filter and index blocks.

One potential solution is to increase block cache size for the State column. While it solves the problem for now, it is still suboptimal because it has to be adjusted when our data changes. The alternative proposed solution is to disable storing filter and index in the block cache for State/FlatState columns and let RocksDB store it on the heap. This effectively makes RocksDB use unbounded memory to store these blocks, which is OK in our case since read performance is critical to ensure reasonable block processing time. On top of that we can use RocksDB RAM usage dashboard to monitor memory usage for filter and index blocks.

Disabling cache_index_and_filter_blocks will result in 800+300=1200MB memory usage increase by RocksDB, so we need to adjust our block cache accordingly. We effectively don’t cache any data blocks for the State column right now since filter and index blocks are cached with a higher priority than the data blocks. So setting the default of 32MB for State shouldn’t result in any performance degradation. FlatState is a bit different since we heavily rely on data locality there to ensure low read latency. We can keep the FlatState block cache size as a difference between the current block cache size and the total size of index and filter blocks: ~256MB. So overall node memory usage increase for mainnet node is ~300MB. It also makes sense to evaluate different block cache sizes for State/FlatState. TODO(@jbajic): create a separate issue for that.

pugachAG commented 1 year ago

One hacky way to confirm the issue with current State filter and index size being greater than the block cache size is to drop OS page cache in the loop while node is running: something like while true; do sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; done. This results in avg State read latency increasing from 800us to 2.5ms. Doing the same when cache_index_and_filter_blocks is disable doesn't result in such drastic latency increase.

walnut-the-cat commented 1 year ago

Is there some guideline on when to communicate to validators with memory usage increase? cc. @gmilescu

jbajic commented 1 year ago

As suggested in the first idea one possible improvement for the State column is disabling index_and_filter_block for the State column and seeing if it improves performance. Then fine-tuning the block cache size is necessary. This was being done on the GCP node and the results can be seen here. They show that having block cache size for the State column plays no part in the state-perf benchmark which makes sense since the designed benchmark is made to test random unique reads across the State column. Also, a node was set up to watch the performance of the State column over a limited time whose results we can see in the images below.

The second hypothesis is that block size adjustment should yield better performance, and to test that we have run multiple benchmarks with different block sizes to see results which can be found here, and to test that a compaction tool was developed. The idea is that smaller block sizes will make random fetching of blocks faster at least for the State column where we do not rely on data locality at all. The synthetic benchmark has shown that 4KiB block size is the most optimal one, but the run-on node and watching metrics in Grafana have shown no big impact on the performance. It is possible due to not compacting the whole database prior to that.

Latency and request count on master master-6h-mainnet-1

Latency and request count without State block cache and 4KiB block size (no compaction run) no_block_cache_block_size_4KiB_6h_mainnet-1

Latency and request count without State block cache and 16KiB block size (no compaction run) no_block_cache_block_size_16KiB_6h_mainnet-1

Average and median display on master from RockDB metrics master-6h-mainnet-1-avg-vs-med

Average and median display without State block cache and block size 4KiB from RockDB metrics no_block_cache_block_size_4KiB_6h-mainnet-1_med_vs_lat

Average and median display without State block cache and block size 16KiB from RockDB metrics no_block_cache_block_size_16KiB_6h-mainnet-1_med_vs_lat

Overall we can see that there is a rise in latencies both in average and median, in both 4KiB and 16KiB. I assume that there is a benefit to using a block cache in production. which is not visible in the synthetic workload.

jbajic commented 1 year ago

The issue with finding out the best block cache size for FlatState column is that the FlatState benefits from data locality and it is not easy to create a perf tool similar to state perf and expect good performance. After disabling cache_index_and_filter_blocks and setting block cache size to different values we can observe the following results:

Average and median display on FlatState on master fs_master_avg-vs_med-mainnet-2

Average and median display on FlatState with block cache 128 MiB and disabled cache_index_and_filter_blocks fs_block_cache_8MiB_avg-vs_med-mainnet-2

Average and median display on FlatState with block cache 256 MiB and disabled cache_index_and_filter_blocks

fs_block_cache_56MiB_avg-vs_med-mainnet-2

We can notice that the FlatState column improvement is noticeable and best in the third case where we have a reduction in average latency and median latency of over 10 microseconds where we have a block cache size of 256 MiB.