rchain / rchain

Blockchain (smart contract) platform using CBC-Casper proof of stake + Rholang for concurrent execution.
Other
693 stars 216 forks source link

Testing of LFS #3098

Closed tgrospic closed 3 years ago

tgrospic commented 4 years ago

Overview

This issue is to record several rounds of testing of Last Finalized State pre release versions. Tests are more performance oriented and some of the bug fixes and optimitations are done between each test setup.

Results are starting from the last setup. Shows how speed is increased with stronger machine. This is due to additonal computation needed for verification of the Rholang state (#3145). This is visible in high usage of CPU during sync.

Another important factor is disk speed. All tests are done on SSD disks and on stronger machines faster speed reduces overall sync duration.

Testing setup 4

This round is almost identical to setup 3 with additional optimization in receiving blocks (#3243) which include block sorting as part of download and with testing on 32 vCPU machine.

RNode Docker image version: rchain/rnode:v0.9.26-rc

Nr. Sync time
(fully catch up)
Sync Rholang state Sync blocks VPS Log
1 2.25 h 1.5 h 1.75 h 32vCPU 64GB (DigitalOcean) test-4-do-32vcpu-64g.zip
2 3.25 h 1.5 h 2 h 16vCPU 32GB CPX51 (Hetzner) test-4-cpx51-32g.zip
3 3 h 1.75 h 2 h 8vCPU 32GB (IBM) test-4-ibm-8vcpu-32g.zip
4 4 h 2.25 h 2.5 h 8vCPU 16GB CPX41 (Hetzner) test-4-cpx41-16g.zip
5 8 h 4.5 h 3.5 h 4vCPU 8GB CPX31 (Hetzner) test-4-cpx31-8g.zip

NOTE: CPU and memory observed in testing is the same as in setup 3, only overall duration is shorter.

[1] CPU, memory - 32vCPU 64GB (DigitalOcean) test-4-do-32vcpu-64g-lfs-progress-cpu

Testing setup 3

First two rounds of testing showed that LFS source node disk reading speed can make significant difference in overall speed for nodes to download Rholang state. In this testing setup, to make reading from disk much faster RSpace folder on source node is converted to RAM disk which reduced reading from ~20sec to ~1sec for chunk of state sent over the network.

RNode Docker image version: tgrospic/rnode:v0.9.26-rc

NOTE: Bug with memory leak on network errors is still not resolved and one test with 8GB machine failed not listed in results.

Nr. Sync time
(fully catch up)
VPS
1 4 hours (2.5 h blocks)
test-3-cpx51-32g-full-ok-3.zip
16vCPU 32GB CPX51
2 5.5 hours (3.5 h blocks)
test-3-cpx41-16g-full-ok-1.zip
8vCPU 16GB CPX41
3 9 hours (4.5 h blocks)
test-3-cpx31-8g-full-ok-1.zip
4vCPU 8GB CPX31

Observations from testing

[1] CPU, memory - 16vCPU 32GB CPX51 test-3-cpx51-32g-lfs-progress-cpu

[1] Direct buffer - 16vCPU 32GB CPX51 test-3-cpx51-32g-lfs-progress-buffer

[2] CPU, memory - 8vCPU 16GB CPX41 test-3-16g-lfs-progress-cpu

[3] CPU, memory - 4vCPU 8GB CPX31 test-3-8g-lfs-progress-cpu

Survived network errors without crashing - 4vCPU 8GB CPX31 test-3-8g-lfs-progress-3-errors-cpu

Testing setup 2

RNode Docker image version: tgrospic/rnode:v0.9.26-beta

Nr. Source Duration Sync time
(fully catch up)
Environment
1 Full 12.5 hours (2.5 h blocks) - One machine + 3 active nodes
2 LFS ERROR interrupted
rnode-restart-error.zip
- Target alone on another machine
(Docker on Hetzner cloud CPX31)
3 LFS [2] 4.5 hours (2.5 blocks)
rnode-test-3.zip
8 hours Target alone on another machine
(Docker on Hetzner cloud CPX31)
4 LFS [3] 6 hours (2 blocks)
rnode-test-4.zip
10 hours Target alone on another machine
(Docker on Hetzner cloud CX31)

Observations from testing

[3] After node received LFS, it's not responding to requests for tuple space (StateItems). After restart it works as expected.

[2] CPU, memory image

[2] Direct buffer memory image

Testing setup 1

~Trie traversal is slow when number of records is bigger then 3,000 and it's rising faster then linear.~ Resolved in PR #3099.

Performance of LFS should be tested with this configuration.

RNode Docker image for all nodes: rchain/rnode:v0.9.26-alpha

  1. Observer node with full state
  2. Observer node with empty state (bootstrap from 1.)
  3. Observer node with empty state (bootstrap from 2.)

~Observer nodes with empty state (2. and 3.) must have higher limit for direct buffer memory to 3GB. -XX:MaxDirectMemorySize=3g~ Memory leak with direct buffer is fixed in transport layer #3239 so now limit can be much lower. -XX:MaxDirectMemorySize=200m

First operation is to test syncing of LFS from the full node. When this is done second syncing should use this node with trimmed state as a source to sync the third node.

Operation Expected Measured
1. -> 2. 4-6 hours 7.5 hours (one machine + 1 active nodes)
11 hours (one machine + 4 active nodes)
2. -> 3. 1-3 hours 4 hours (one machine + 4 active nodes)
tgrospic commented 3 years ago

LFS branch is merged to dev with final release published. https://github.com/rchain/rchain/releases/tag/v0.10.0