nervosnetwork / ckb-light-client

CKB light client reference implementation
MIT License
14 stars 16 forks source link

Unusually high network activity from CKB Light Client. #168

Closed jordanmack closed 7 months ago

jordanmack commented 8 months ago

I run ten testnet CKB light clients on a virtual machine for testing purposes. Today I received a notification from my ISP that I was approaching my usage limit. My router reports several hundred GB of transmission in the last few days, which is very unusual. I traced the increase in network activity to the virtual machine with the light clients. The current version of the light clients running is v0.3.0, which was installed on November 1, 2023.

Below is a screenshot of the host node activity charts for the month. Starting on December 13th, there was an increase in network activity which continued to increase until hitting nearly 3MB/s on December 19th. This network activity is persistent and amounts to hundreds of GB over a few days. 20231222135445

Here are a few minutes of internet network activity for my network. 20231222135247

Looking at a ps of the clients on the host machine. One node has significantly more CPU time than the others. However, there is nothing unusual about the screen output. (This was done after restarting the virtual machine and all the light clients when high network activity was noticed again.) 20231222140417 20231222140558

After killing the client with high CPU time, the bandwidth has dropped significantly. The other 9 light client nodes are still running. 20231222141430

I have turned off this virtual machine for the time being. If you would like me to debug further, please let me know how I should do so.

quake commented 8 months ago

are there any difference in the scripts setting for these 10 light client? could you help to call the get_scripts rpc and post result here?

jordanmack commented 8 months ago

are there any difference in the scripts setting for these 10 light client? could you help to call the get_scripts rpc and post result here?

I do not see any difference. They all return an empty result.

{
    "jsonrpc": "2.0",
    "result": [],
    "id": 42
}
quake commented 8 months ago

could you upgrade to v0.3.3? I suspect it might be related to this bug fix: https://github.com/nervosnetwork/ckb-light-client/pull/163

jordanmack commented 8 months ago

could you upgrade to v0.3.3? I suspect it might be related to this bug fix: #163

I upgraded to v0.3.3 by replacing the binary without deleting the configuration or data files. I no longer see high bandwidth issues. However, I see that one node has very significantly higher CPU usage than the others. This is node 9, where as before it was node 1 with high CPU and high bandwidth.

20231225222220

Output from the console is below.

log.txt

quake commented 8 months ago

are there any diff between config/testnet9.toml and others? could you upload one log file of the normal ones? thanks.

jordanmack commented 8 months ago

The only differences are the ports used.

testnet8-config.txt testnet9-config.txt

Here is the console output from client 8, which is running normally.

testnet8-log.txt

yangby-cryptape commented 8 months ago

Was the tip header of node 9 correct? (RPC method: get_tip_header)

Base on current logs, I guess there are abnormal or malicious CKB nodes, they send unexpected responses to the light client. The light client should ban it, but doesn't. (This feature is unimplemented now.)

So, I want to confirm whether the light client work properly.


Why I think the remote peer is abnormal or malicious.

Light client won't send such many requests at one time. In fact, at most 1 request every 10 seconds. But the node sends 367 responses at "2023-12-25T23:39:00Z".

:point_right: Or, was there a network storm?


The log message misses a "not".

The warning message should be "is not continuous with last filter block number", the "not" is missed by typo.

I will fix that later.

jordanmack commented 8 months ago

I left all clients running overnight. The elevated CPU activity has stopped now.

20231226104744

The current block on client 9 from get_tip_header is 11,737,540, which is behind the current testnet at 11,739,978.

Below is the most recent console output.

testnet9-log.txt

My router reports that there was a normal amount internet activity during the night. However, I noticed something interesting in the virtual machine dashboard. There was high network activity for several hours, but this time it was only local. I suspect that this is because I have local testnet full node (v0.110.0) which is configured as the sole boot node in the light client configuration.

20231226105614

The clients started in an off state, and you can see above when I turned the clients back on based on the memory usage. Looking at the network traffic, you can see approximately 18 hours of high network activity before stopping by itself. It was receiving 1.6 MBps continuously during this period. This chart is MBps (megabytes per second), not Mbps (megabits per second). This is approximately 100 GB for the 18 hour period.

Here is the output of get_tip_header for all clients. All are caught up except client 9.

{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d095cdb","dao":"0x7622e7ee687d484a72ef796cd4c5270034e209fcfed4d30500552addf42cce08","epoch":"0x708033f001ebf","extra_hash":"0xd29319209353f62f2f07bf192644bcdd8885ea3613ada7f34d09c0ddcec67495","hash":"0xdd4b522a51298ba00b5198112589175385c3cf06fac4168757026d78e9d4bec4","nonce":"0x8acba2f93ba31690fe4879cb12d8145a","number":"0xb323e8","parent_hash":"0x71cd87b90abc049e0a1b592a434faa10e09fbe1ba2c07b927e771936c958566c","proposals_hash":"0x46a1887efd5e4ca6c7980cfaeb89d1fafb2ee62ec7c3e0977f29ac18bcbc015b","timestamp":"0x18ca79af537","transactions_root":"0x395e186389ecf95cf194ea891fdb139b26cfe8fb03708cd6932b0ba8c6464917","version":"0x0"},"id":1}
{"jsonrpc":"2.0","result":{"compact_target":"0x1d08e641","dao":"0x78e44f925a31474a8f2b7252a9c52700528e94370a8ed305006e98e6681ece08","epoch":"0x7080023001ebe","extra_hash":"0x138888b369a0da4f8ac80bab46c2c72948932db8674da901b7e85c749cb5b90b","hash":"0x30e553a752dbb6258d7d141de2a5e014df8ee6a35fdf41a4060c31d74028f8ce","nonce":"0x5facfe88d6e00103809525ad07500319","number":"0xb319c4","parent_hash":"0xaf6a66074c42f16675f9641f69988c0a9514520515275725a3a0f5d27f1b6af1","proposals_hash":"0x13068f7a21a7c4a32c60aa68295ec15d0c0a11165cd0efd5bc9c4b5d30bd9dfd","timestamp":"0x18ca654f1d7","transactions_root":"0x27d8c4cfa7b043f8ce9ed41097376db997e99148779f1f675b127a85a678aba6","version":"0x0"},"id":1}
yangby-cryptape commented 8 months ago

In the log file "testnet9-log.txt", it print the following log message every 8 seconds:

block number (11737540) in storage is not less than the block number (11737540) which requires proving

So, those log messages mean that remote CKB node (peer SessionId(23)) kept block 11737540 as its tip for hours (at least 2023-12-26T16:45:21Z to 2023-12-26T19:01:15Z). The light client worked well, but it didn't connect to any peers other than that peer SessionId(23).

yangby-cryptape commented 8 months ago

Could you provide the results of:

jordanmack commented 8 months ago

get_peers for client 9:

{
    "jsonrpc":"2.0",
    "result":[
        {
            "addresses":[
                {
                    "address":"/ip4/192.168.0.73/tcp/18115/p2p/QmQnnZeQnpsABhQEz5hGMoV4jkuDR3cRr6STriTEmBzKFu",
                    "score":"0x64"
                }
            ],
            "connected_duration":"0x325293e",
            "node_id":"QmQnnZeQnpsABhQEz5hGMoV4jkuDR3cRr6STriTEmBzKFu",
            "protocols":[
                {
                    "id":"0x64",
                    "version":"2"
                },
                {
                    "id":"0x0",
                    "version":"2"
                },
                {
                    "id":"0x2",
                    "version":"2"
                },
                {
                    "id":"0x78",
                    "version":"2"
                },
                {
                    "id":"0x1",
                    "version":"2.1"
                },
                {
                    "id":"0x79",
                    "version":"2"
                },
                {
                    "id":"0x4",
                    "version":"2"
                }
            ],
            "sync_state":{
                "proved_best_known_header":null,
                "requested_best_known_header":null
            },
            "version":"0.110.0 (0679b11 2023-05-16)"
        }
    ],
    "id":1
}

All the other clients have long lists of clients for get_peers.

sync_state for local boot node:

{
    "jsonrpc": "2.0",
    "result": {
        "best_known_block_number": "0xb33386",
        "best_known_block_timestamp": "0x18ca97dadf9",
        "fast_time": "0x3e8",
        "ibd": false,
        "inflight_blocks_count": "0x0",
        "low_time": "0x5dc",
        "normal_time": "0x4e2",
        "orphan_blocks_count": "0x0"
    },
    "id": 42
}
quake commented 8 months ago

It looks like your local bootnode is not broadcasting other full node addresses to client 9. This is weird, all other clients are receiving addresses and are able to connect to more nodes as normal.

Could you upgrade your local bootnode to latest version, then reboot client 9 and try again?

jordanmack commented 8 months ago

It looks like your local bootnode is not broadcasting other full node addresses to client 9. This is weird, all other clients are receiving addresses and are able to connect to more nodes as normal.

Does any of this explain why something used 100GB of bandwidth?

Could you upgrade your local bootnode to latest version, then reboot client 9 and try again?

Will try next.

jordanmack commented 8 months ago

The full node and ckb light client 9 are now at the newest available versions.

Client 9 is using more CPU cycles and high bandwidth again. Network activity on the virtual machine is 3.5MB/s. Internet traffic is just over 1.5MB/s. It is probable that it is sustaining 1.5MB/s to the local full node and 1.5MB/s to the external full node.

20231227193813 20231227194541 20231227194554

Here is the current screen output from client 9.

client9-log.txt

Here is the result of get_peers. There is a second peer now.

client9-get-peers.txt

Here is the result of get_tip_header. It was at the tip for the entire time I was tracking.

{
    "jsonrpc": "2.0",
    "result": {
        "compact_target": "0x1d0892c8",
        "dao": "0x9655c8eedbc64f4a20fbaa85c6c62700b165db4ca063d50500c18d82bf7ece08",
        "epoch": "0x70803f7001ec7",
        "extra_hash": "0x9ed419567e868465abbe943ef8a1859764092a89f1f62646569ce69af17b18b4",
        "hash": "0x8e9ef98d1de9eee0fb2a2d8271f29766d5380929898b94ae93c749d372f23983",
        "nonce": "0x33aeecb665c83f6cdb9627e306eb7b49",
        "number": "0xb35ce0",
        "parent_hash": "0x9c16173c7f423767b930ef9d2574272a791a95bd93a7ceccf7c27933d78e7579",
        "proposals_hash": "0x70bb2879cf3a8f386e6d892b2f14b54a9337b9f8418b31ff59a44fccc7f5ae3d",
        "timestamp": "0x18cae80427a",
        "transactions_root": "0x8225a143c7b4939296009344a4c5c7bb4f79e0f66bef6e30dfb2af2dad24041e",
        "version": "0x0"
    },
    "id": 42
}

After stopping client 9, network activity ceases.

quake commented 8 months ago

could you zip the client 9 data folder and upload? I'd like to try to reproduce it in my local environment.

jordanmack commented 8 months ago

could you zip the client 9 data folder and upload? I'd like to try to reproduce it in my local environment.

data9.tar.gz

quake commented 8 months ago

we have reproduced this abnormal network usage, working on bug fix, will update later, thanks.

yangby-cryptape commented 8 months ago

There are 3 sub-issues for this issue:

  1. [ ] Light client won't discover enough nodes.

    Light client require 50% outbound nodes (default max_outbound_peers is 8) to finalize a checkpoint.

    https://github.com/nervosnetwork/ckb-light-client/blob/fc45e35cd3ce40fae5762daa97e4c89991094262/config/testnet.toml#L54

    The reason for this issue is still unsure, but it maybe related to nervosnetwork/ckb#4133.

  2. [x] Light client will sync all block filter hashes after the finalized checkpoint.

    Then, light client will sync all block filter hashes from all connected peers when no enough connected peers, because the first issue which lead to finalized-checkpoint = 0.

    So, it causes a huge bandwidth and CPU cost.

    Ref: nervosnetwork/ckb-light-client#169

  3. [ ] Light clients receive a fixed state from some CKB nodes.

    I checked some CKB nodes which sent the same tip header during hours, their tip headers are correct.

    The state which light client received is the state when light client connected to the CKB node.

    The reason for this issue is still unknown.

quake commented 8 months ago

@jordanmack could you update your light clients' config bootnodes section to the public nodes: https://github.com/nervosnetwork/ckb-light-client/blob/develop/config/testnet.toml#L18-L43

I think it may resolve this cpu abnormal usage as a temp solution.

yangby-cryptape commented 8 months ago

I add some rules to ban a peer:

Light clients could not distinguish a peer whether it just has a stale state or it is malicious, so just ban it, then find a new peer to instead of it.

p.s. After the few minutes, the ban status will be removed.

quake commented 8 months ago

you may upgrade light client to 0.3.4 now, it will resolve this cpu and network abnormal usage issue: https://github.com/nervosnetwork/ckb-light-client/releases/tag/v0.3.4

jordanmack commented 7 months ago

So far everything is running fine. Nothing except for a few long fork errors from two clients on startup.

If I shutdown a client and delete the data/network/peer_store folder, will that reset the known peers and force it to use the boot nodes again?

jordanmack commented 7 months ago

One other question, do light clients create connections with other light clients for any reason, or do they only connect to full nodes?

quake commented 7 months ago

If I shutdown a client and delete the data/network/peer_store folder, will that reset the known peers and force it to use the boot nodes again?

yes

One other question, do light clients create connections with other light clients for any reason, or do they only connect to full nodes?

only connect to full nodes