Peculiar waves of disconnects on VM hosting 100 light clients.

jordanmack commented 6 months ago

I increase the amount of testnet light clients on my VM from 10 to 100. All clients are running the same configuration and are started at the same time. I also put together a small monitoring program that checks for the following:

Check for any clients that are offline (not responding).
Check for any clients with less than 2 peers.
Check for any clients that report a tip lagging behind the others by more than 30 blocks.

It then sleeps for 60s before repeating.

The are a few interesting things that are seen in the logs:

There are three clients in particular that seem to have problems staying connected to at least two peers: 35, 83, 99
There are recurring waves of 10 or more clients that drop connection at the same time.

In occurs to me that the testnet only has 34 full nodes online according to the node probe. Is there any logic in the full nodes that would cause ban waves if there are too many connections coming in?

Example config file. testnet99.toml.txt

Two days of monitor logs: output.log

jordanmack commented 6 months ago

I've created a new network test to gather more information.

4 local testnet full nodes.
Full nodes can access the internet normally.
100 local light clients.
Light clients can access the local full nodes but not the internet.

quake commented 6 months ago

One ckb full node can accept up to 125 - 8 = 117 connections by default configuration: https://github.com/nervosnetwork/ckb/blob/develop/resource/ckb.toml#L83-L84

Considering the small number of online full nodes on testnet, I think it's normal to have a small number of light client nodes that don't connect after you've started 100 light client.

Is there any logic in the full nodes that would cause ban waves if there are too many connections coming in?

bootnode does have some logic to drop connections periodically, but from the logs you provided, the behavior is not quite the same as this drop policy, we need to investigate a bit more.

jordanmack commented 6 months ago

I am still seeing waves of disconnects in the new test environment. The 100 light clients are restricted from having internet access, but they have a perfect connection to the four local testnet full nodes since they all reside in different VMs on a single host computer.

In the log snippet below you can see each time the monitor starts a scan. The first two report no issues, meaning all 100 light clients have at least 2 peer connections. Then in the third scan a minute later, there is a wave of connection drops by 68 of the 100 nodes.

20240122 14:23:04 [INFO] Scan start.
20240122 14:24:16 [INFO] Scan start.
20240122 14:25:28 [INFO] Scan start.
20240122 14:25:39 [INFO] There are 64 clients with 0 peers: 6, 14, 15, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39, 40, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 55, 56, 58, 59, 60, 61, 62, 63, 64, 66, 69, 70
, 71, 72, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 95, 96
20240122 14:25:39 [INFO] There are 4 clients with 1 peer: 8, 37, 54, 99

The monitor log reflects many events like this. Resource utilization all looks normal. No firewalls are installed.

Monitor Log: monitor-log.tar.gz

Light Client Logs: client-logs.tar.gz

Full Node Logs: full-node-2.tar.gz full-node-3.tar.gz full-node-4.tar.gz full-node-5.tar.gz

Config Files: testnet-base.toml.txt ckb.toml.txt

nervosnetwork / ckb-light-client

Peculiar waves of disconnects on VM hosting 100 light clients. #182