nervosnetwork / ckb-light-client

CKB light client reference implementation
MIT License
14 stars 16 forks source link

Peculiar waves of disconnects on VM hosting 100 light clients. #182

Open jordanmack opened 6 months ago

jordanmack commented 6 months ago

I increase the amount of testnet light clients on my VM from 10 to 100. All clients are running the same configuration and are started at the same time. I also put together a small monitoring program that checks for the following:

It then sleeps for 60s before repeating.

The are a few interesting things that are seen in the logs:

In occurs to me that the testnet only has 34 full nodes online according to the node probe. Is there any logic in the full nodes that would cause ban waves if there are too many connections coming in?

Example config file. testnet99.toml.txt

Two days of monitor logs: output.log

jordanmack commented 6 months ago

I've created a new network test to gather more information.

quake commented 6 months ago

One ckb full node can accept up to 125 - 8 = 117 connections by default configuration: https://github.com/nervosnetwork/ckb/blob/develop/resource/ckb.toml#L83-L84

Considering the small number of online full nodes on testnet, I think it's normal to have a small number of light client nodes that don't connect after you've started 100 light client.

Is there any logic in the full nodes that would cause ban waves if there are too many connections coming in?

bootnode does have some logic to drop connections periodically, but from the logs you provided, the behavior is not quite the same as this drop policy, we need to investigate a bit more.

jordanmack commented 6 months ago

I am still seeing waves of disconnects in the new test environment. The 100 light clients are restricted from having internet access, but they have a perfect connection to the four local testnet full nodes since they all reside in different VMs on a single host computer.

In the log snippet below you can see each time the monitor starts a scan. The first two report no issues, meaning all 100 light clients have at least 2 peer connections. Then in the third scan a minute later, there is a wave of connection drops by 68 of the 100 nodes.

20240122 14:23:04 [INFO] Scan start.
20240122 14:24:16 [INFO] Scan start.
20240122 14:25:28 [INFO] Scan start.
20240122 14:25:39 [INFO] There are 64 clients with 0 peers: 6, 14, 15, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39, 40, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 55, 56, 58, 59, 60, 61, 62, 63, 64, 66, 69, 70
, 71, 72, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 95, 96
20240122 14:25:39 [INFO] There are 4 clients with 1 peer: 8, 37, 54, 99

The monitor log reflects many events like this. Resource utilization all looks normal. No firewalls are installed.

Monitor Log: monitor-log.tar.gz

Light Client Logs: client-logs.tar.gz

Full Node Logs: full-node-2.tar.gz full-node-3.tar.gz full-node-4.tar.gz full-node-5.tar.gz

Config Files: testnet-base.toml.txt ckb.toml.txt