(HashConf) run deep reconnect tests 10+ nodes

alex-kuzmin-hg commented 1 month ago

Per HashConf 2024 brainstorming sessions:

Artem Ananev 8:17 AM Hi Alex and the team. Let me try to summarize what we discussed about reconnect testing in the last few days: Step 0: configure a new network in latitude to run consensus nodes and Oleg’s load generator. More nodes the better, but more than 10-11 probably wouldn’t add much value. Load generator at this step can be used to generate a state (accounts and tokens), and after that to start NFT transfers

Step 1: run develop with 40/40 state and TPS limited to 5K. Only a small fraction of these 40M accounts should be hot, like 1M or even less, other accounts will be not very active. Please, check with Oleg how to configure that part. The nodes should be running stable at this point

Step 1a: after the state is fully generated, and NFT transfers are in progress at 5K for a few minutes (e.g. 10 mins), shut down one node and start it back in 10 mins. This will make the node start a reconnect process. It would be great to have this stop/restart process automated, since this is a crucial part of reconnect testing

Step 1b: if reconnect is successful, repeat step 1a a few times The next steps will depend on steps 1/1a results. If the node is able to reconnect, we will increase the TPS (ideally, to 10K) and/or increase state size (to 100M, ideally to 1B) and/or increase node shutdown period (15 mins, 30 mins, 1 hour, 3 hours). If reconnects fail, we will need to check why. It could be because of reconnects themselves, or because of the health monitor, or something entirely different Once Oleg prepares a small fix for the health monitor to lower its resolution (to run every 1ms instead of 100ms) as we discussed, it will make sense to use that branch for testing. It should help a little bit with the final “catching up” part of the reconnect process Once my changes for QueueNode and in-memory virtual maps are available, it will also make sense to test them, since we expect it will have positive impact on reconnects (the reconnect part) Does it look like a good plan? Any comments? Thanks!

alex-kuzmin-hg commented 1 month ago

Results: latest "develop" (e.g. 1619a5bde1584e3e1a4427a7d341dfd5f766a7b3 at the time of experiment) - Learner cannot catchup up to ACTIVE state ever, until test is finished.

Artem's branch - CATASTROPHIC_FAILURE

alex-kuzmin-hg commented 1 month ago

10-node cluster is set (Nathan!)

start: helm install nlg oci://swirldslabs.jfrog.io/load-generator-helm-release-local/network-load-generator --version 0.2.1 --values nlg-values.yaml -n solo-alex-kuzmin

watch: kubectl logs -n solo-alex-kuzmin nlg-network-load-generator-7994f9fd98-fkq5s -f

alex-kuzmin-hg commented 1 month ago

Done: Setup 5 Longitude Solo clusters, scripted deployment of individual branches of Product and NFT tests, scipted Reconnect test etc. Reporting to the team starting at: https://swirldslabs.slack.com/archives/C06B0QQQ6MR/p1729120988583579 and summary of first full run up to 1B/1B: https://swirldslabs.slack.com/archives/C06B0QQQ6MR/p1729556066239499

alex-kuzmin-hg commented 1 month ago

Reconnect, specifically: NS=solo-alex-kuzmin, 50M, TCP=1K (explored from 5K down to 1K)

hashgraph / hedera-services

(HashConf) run deep reconnect tests 10+ nodes #15608