Testing of LFS - Githubissues

Overview

This issue is to record several rounds of testing of Last Finalized State pre release versions. Tests are more performance oriented and some of the bug fixes and optimitations are done between each test setup.

Results are starting from the last setup. Shows how speed is increased with stronger machine. This is due to additonal computation needed for verification of the Rholang state (#3145). This is visible in high usage of CPU during sync.

Another important factor is disk speed. All tests are done on SSD disks and on stronger machines faster speed reduces overall sync duration.

Testing setup 4

This round is almost identical to setup 3 with additional optimization in receiving blocks (#3243) which include block sorting as part of download and with testing on 32 vCPU machine.

RNode Docker image version: rchain/rnode:v0.9.26-rc

Nr.	Sync time (fully catch up)	Sync Rholang state	Sync blocks	VPS	Log
1	2.25 h	1.5 h	1.75 h	32vCPU 64GB (DigitalOcean)	test-4-do-32vcpu-64g.zip
2	3.25 h	1.5 h	2 h	16vCPU 32GB CPX51 (Hetzner)	test-4-cpx51-32g.zip
3	3 h	1.75 h	2 h	8vCPU 32GB (IBM)	test-4-ibm-8vcpu-32g.zip
4	4 h	2.25 h	2.5 h	8vCPU 16GB CPX41 (Hetzner)	test-4-cpx41-16g.zip
5	8 h	4.5 h	3.5 h	4vCPU 8GB CPX31 (Hetzner)	test-4-cpx31-8g.zip

NOTE: CPU and memory observed in testing is the same as in setup 3, only overall duration is shorter.

[1] CPU, memory - 32vCPU 64GB (DigitalOcean) test-4-do-32vcpu-64g-lfs-progress-cpu

Testing setup 3

First two rounds of testing showed that LFS source node disk reading speed can make significant difference in overall speed for nodes to download Rholang state. In this testing setup, to make reading from disk much faster RSpace folder on source node is converted to RAM disk which reduced reading from ~20sec to ~1sec for chunk of state sent over the network.

RNode Docker image version: tgrospic/rnode:v0.9.26-rc

NOTE: Bug with memory leak on network errors is still not resolved and one test with 8GB machine failed not listed in results.

Nr.	Sync time (fully catch up)	VPS
1	4 hours (2.5 h blocks) test-3-cpx51-32g-full-ok-3.zip	16vCPU 32GB CPX51
2	5.5 hours (3.5 h blocks) test-3-cpx41-16g-full-ok-1.zip	8vCPU 16GB CPX41
3	9 hours (4.5 h blocks) test-3-cpx31-8g-full-ok-1.zip	4vCPU 8GB CPX31

Observations from testing

[1] CPU, memory - 16vCPU 32GB CPX51 test-3-cpx51-32g-lfs-progress-cpu

[1] Direct buffer - 16vCPU 32GB CPX51 test-3-cpx51-32g-lfs-progress-buffer

[2] CPU, memory - 8vCPU 16GB CPX41 test-3-16g-lfs-progress-cpu

[3] CPU, memory - 4vCPU 8GB CPX31 test-3-8g-lfs-progress-cpu

Survived network errors without crashing - 4vCPU 8GB CPX31 test-3-8g-lfs-progress-3-errors-cpu

Testing setup 2

RNode Docker image version: tgrospic/rnode:v0.9.26-beta

Nr.	Source	Duration	Sync time (fully catch up)	Environment
1	Full	12.5 hours (2.5 h blocks)	-	One machine + 3 active nodes
2	LFS	ERROR interrupted rnode-restart-error.zip	-	Target alone on another machine (Docker on Hetzner cloud CPX31)
3	LFS [2]	4.5 hours (2.5 blocks) rnode-test-3.zip	8 hours	Target alone on another machine (Docker on Hetzner cloud CPX31)
4	LFS [3]	6 hours (2 blocks) rnode-test-4.zip	10 hours	Target alone on another machine (Docker on Hetzner cloud CX31)

Observations from testing

[3] After node received LFS, it's not responding to requests for tuple space (StateItems). After restart it works as expected.

[2] CPU, memory

[2] Direct buffer memory

Testing setup 1

~Trie traversal is slow when number of records is bigger then 3,000 and it's rising faster then linear.~ Resolved in PR #3099.

Performance of LFS should be tested with this configuration.

RNode Docker image for all nodes: rchain/rnode:v0.9.26-alpha

Observer node with full state
Observer node with empty state (bootstrap from 1.)
Observer node with empty state (bootstrap from 2.)

~Observer nodes with empty state (2. and 3.) must have higher limit for direct buffer memory to 3GB. -XX:MaxDirectMemorySize=3g~ Memory leak with direct buffer is fixed in transport layer #3239 so now limit can be much lower. -XX:MaxDirectMemorySize=200m

First operation is to test syncing of LFS from the full node. When this is done second syncing should use this node with trimmed state as a source to sync the third node.

Operation	Expected	Measured
1. -> 2.	4-6 hours	7.5 hours (one machine + 1 active nodes) 11 hours (one machine + 4 active nodes)
2. -> 3.	1-3 hours	4 hours (one machine + 4 active nodes)

rchain / rchain

Testing of LFS #3098

Overview

Testing setup 4

Testing setup 3

Observations from testing

Testing setup 2

Observations from testing

Testing setup 1