Investigate relationship between state size and performance on FT benchmark

akashin commented 1 week ago

We know that contract state size influences performance of the chain because it affects the latency of smart contract operations interacting with the disk. It is especially noticeable on shards 2 and 5 on mainnet.

We also did some initial experiments enabling in-memory trie (https://github.com/near/nearcore/issues/11458) that showed that with small contract state size, it had no noticeable effects on performance.

The goals here is to understand which FT contract size is enough to reproduce the performance profile on mainnet. We can then investigate how much in-memory trie improves the performance. Finally, we can also investigate effects of enabling/disabling various caches.

akashin commented 1 week ago

Here are the instructions to run the benchmark with state:

# Setup for node.
gcloud compute ssh --zone "europe-west4-a" "ubuntu@crt-benchmark-ft-092d" --project "nearone-crt"
nearup run localnet --binary-path /home/ubuntu --num-nodes 1 --num-shards 1 --verbose && nearup stop
cp .near/config.json .near/localnet/node0
cp .near/genesis.json .near/localnet/node0
rm -r .near/localnet/node0/data
ln -s /home/ubuntu/.near/data .near/localnet/node0/data
nearup run localnet --binary-path /home/ubuntu --num-nodes 1 --num-shards 1 --verbose

# Setup for traffic generator.
gcloud compute ssh --zone "europe-west4-b" "ubuntu@crt-benchmark-ft-traffic" --project "nearone-crt"
tmux new -s load
sudo apt install socat
socat TCP-LISTEN:3030,fork,reuseaddr TCP:10.164.15.201:3030
curl 127.0.0.1:3030/status

export KEY=/home/ubuntu/validator_key.json
cd nearcore/pytest/tests/loadtest/locust/
locust -H 127.0.0.1:3030 \
       -f locustfiles/ft.py \
       --funding-key=$KEY  --fixed-contract-names --num-ft-contracts=1 --num-passive-users=0 --max-workers=8 -u 500 -r 100 --headless --processes 8

akashin commented 9 hours ago

I finished experiments with large FT contract state (45M users, 5.6 GB):

Without memtrie, we can only reach 200 TPS before going over 1s budget (without state, we can get to 800 TPS)
With memtrie, we can reach 800 TPS, a good 4x gain
Touching state was essential here, as without it, memtrie can reach 1000 TPS with <500ms apply chunk latency
Spent a fair bit of time setting up OTLP tracing on FT benchmark machine to investigate the hotspots, but didn't manage to narrow it down due to trace sampling

Overall, we know that the state that we have is big enough to expose database latencies, so I think we can close this issue.

near / nearcore

Investigate relationship between state size and performance on FT benchmark #11729