nervosnetwork / ckb

The Nervos CKB is a public permissionless blockchain, and the layer 1 of Nervos network.
https://www.nervos.org
MIT License
1.14k stars 224 forks source link

CKB v114.0 Full node sync is stopped. #4462

Open silySuper opened 2 months ago

silySuper commented 2 months ago

Bug Report

Full node sync is stopped.

Current Behavior

截屏2024-05-15 10 07 34

logs 2.zip

截屏2024-05-15 10 09 17

Environment

testnet/data has been replaced by https://download.magickbase.com/backup_20240513.tar.gz

eval-exec commented 2 months ago

(The link you provided, https://download.magickbase.com/backup_20240513.tar.gz is returning a 404 error.)

When you say, "Full node is stopped," do you specifically mean that the CKB has been synchronizing for a long time but the block height hasn't increased? How long has it been syncing?

Could you provide the output of:

curl -X POST 127.0.0.1:8114 -H 'Content-Type: application/json' -d '{ "id": 42, "jsonrpc": "2.0", "method": "sync_state", "params": [ ] }'

and

curl -X POST 127.0.0.1:8114 -H 'Content-Type: application/json' -d '{ "id": 42, "jsonrpc": "2.0", "method": "get_peers", "params": [ ] }'
silySuper commented 2 months ago

It has been sync from yesterday afternoon to now.

截屏2024-05-15 10 50 51 截屏2024-05-15 10 51 09
eval-exec commented 2 months ago

The get_peers RPC returns an empty result, indicating that the CKB node isn't maintaining network connections with other peers, hence it's unable to synchronize blocks.

  1. Have you made any changes to the default ckb.toml file? If so, how did you modify it? Did you edit the configuration related to the white-list in ckb.toml?
  2. Can you share the complete log file (./data/logs/run.log)?
silySuper commented 2 months ago

run.log

I do not change ckb.toml.This part of ckb.toml.

### Whitelist-only mode
# whitelist_only = false
### Whitelist peers connecting from the given IP addresses
# whitelist_peers = []
### Enable `SO_REUSEPORT` feature to reuse port on Linux, not supported on other OS yet
# reuse_port_on_linux = true

max_peers = 125
max_outbound_peers = 8
# 2 minutes
ping_interval_secs = 120
# 20 minutes
ping_timeout_secs = 1200
connect_outbound_interval_secs = 15
# If set to true, try to register upnp
upnp = false
# If set to true, network service will add discovered local address to peer store, it's helpful for private net development
discovery_local_address = false
# If set to true, random cleanup when there are too many inbound nodes
# Ensure that itself can continue to serve as a bootnode node
bootnode_mode = false
eval-exec commented 2 months ago

I found there are some ERROR in run.log:

2024-05-14 20:18:58.451 +00:00 log flusher ERROR sled::flusher  failed to fsync from periodic flush thread: Input/output error (os error 5)
  1. What happened in 2024-05-14 20:18:58.451?
  2. Could you provide data/network/peer_store/addr_manager.db and data/network/ban_list.db files?
  3. Could you provide:
    curl -X POST 127.0.0.1:8114 -H 'Content-Type: application/json' -d '{ "id": 42, "jsonrpc": "2.0", "method": "get_banned_addresses", "params": [ ] }'

    and

    curl -X POST 127.0.0.1:8114 -H 'Content-Type: application/json' -d '{ "id": 42, "jsonrpc": "2.0", "method": "get_tip_header", "params": [ ] }'
silySuper commented 2 months ago

1.At 2024-05-14 20:18:58.451 ,computer is sleeping(v114.0 is in my hard disk). 2. addr_manager.db.zip ban_list.db.zip

3.

截屏2024-05-15 14 07 12 截屏2024-05-15 14 11 55
silySuper commented 2 months ago
截屏2024-05-15 14 40 28

now it shows error in ckb server.

eval-exec commented 2 months ago

Is your hard drive malfunctioning? Could you change [loggger].filter to "debug" in ckb.toml, then restart ckb node, then provide the log file?

silySuper commented 2 months ago

My hard drive does not throw error before,I will find a tool to check whether it is malfunctioning logs.zip

eval-exec commented 2 months ago

I suspect there might be an issue with the [network] configuration in your config file.

What's the configuration for support_protocols in your ckb.toml file?

Could you share the complete configuration from your ckb.toml file?

eval-exec commented 2 months ago

I found your ckb process is buzy on serving an RPC: I guess it's Indexer's get_cells or get_transactions RPC. I observed that the "id" field in the RPC request is not consistent. Is the ckb process operating as a public node?

❯ cat logs/run.log | grep -i rpc | head
2024-05-15 07:21:02.362 +00:00 main INFO ckb_rpc::server  Listen HTTP RPCServer on address: 127.0.0.1:8114
2024-05-15 07:21:02.944 +00:00 GlobalRt-7 DEBUG rpc  Response: {"jsonrpc":"2.0","result":"0xc9ed7d","id":8362}.
2024-05-15 07:21:02.953 +00:00 GlobalRt-6 DEBUG rpc  Response: {"jsonrpc":"2.0","result":"0x10639e0895502b5688a6be8cf69460d76541bfa4821629d86d62ba0aae3f9606","id":8716}.
2024-05-15 07:21:02.964 +00:00 GlobalRt-6 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"alerts":[],"chain":"ckb_testnet","difficulty":"0x1c794aac","epoch":"0x70800e40021fd","is_initial_block_download":true,"median_time":"0x18f70071de4"},"id":2884}.
2024-05-15 07:21:02.971 +00:00 GlobalRt-0 DEBUG rpc  Response: {"jsonrpc":"2.0","result":"0x10639e0895502b5688a6be8cf69460d76541bfa4821629d86d62ba0aae3f9606","id":1528}.
2024-05-15 07:21:02.972 +00:00 GlobalRt-6 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"alerts":[],"chain":"ckb_testnet","difficulty":"0x1c794aac","epoch":"0x70800e40021fd","is_initial_block_download":true,"median_time":"0x18f70071de4"},"id":2529}.
2024-05-15 07:21:02.975 +00:00 GlobalRt-6 DEBUG rpc  Response: {"jsonrpc":"2.0","result":"0x10639e0895502b5688a6be8cf69460d76541bfa4821629d86d62ba0aae3f9606","id":3712}.
2024-05-15 07:21:03.926 +00:00 GlobalRt-0 DEBUG rpc  Response: {"jsonrpc":"2.0","result":"0xc9ed7d","id":804}.
2024-05-15 07:21:04.931 +00:00 GlobalRt-4 DEBUG rpc  Response: {"jsonrpc":"2.0","result":"0xc9ed7d","id":8224}.
2024-05-15 07:21:05.190 +00:00 GlobalRt-6 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"compact_target":"0x1d08fda0","dao":"0xa58aea41d2aa324d7444ff8c4625280007bc9da5479673060085b8d95c40d408","epoch":"0x70800e40021fd","extra_hash":"0x167c593d80e706b9c2e52c9a6aeebf39fdd08574c55b6deb5df128a1484677cb","hash":"0xf47a17392103d8c423089c6fb42ea3bd15cd44b5a8268c7ae18a72d517906ce2","nonce":"0xc108f9e52219af493fa13978b8cb7429","number":"0xc9ed7d","parent_hash":"0xcdc8fe751ae2731595b674b07f116ab240eefde45b4a911e702b79f58cffc441","proposals_hash":"0xf6d454599fdf28e3c73fb8e7d0d93e3f6128ad650a21b0f745cb9fba8bb0742f","timestamp":"0x18f70094a66","transactions_root":"0x9ff138b4bd1cbb83065343651537ac465209abb26455f0dd603978c6447ea6f4","version":"0x0"},"id":3662}.
/tmp/t
❯ cat logs/run.log | grep -i rpc | tail
2024-05-15 07:22:08.718 +00:00 GlobalRt-1 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"last_cursor":"0x","objects":[]},"id":3306}.
2024-05-15 07:22:08.719 +00:00 GlobalRt-8 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"last_cursor":"0x","objects":[]},"id":7302}.
2024-05-15 07:22:08.719 +00:00 GlobalRt-8 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"last_cursor":"0x","objects":[]},"id":9305}.
2024-05-15 07:22:08.719 +00:00 GlobalRt-8 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"last_cursor":"0x","objects":[]},"id":8167}.
2024-05-15 07:22:08.720 +00:00 GlobalRt-1 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"last_cursor":"0x","objects":[]},"id":3193}.
2024-05-15 07:22:08.721 +00:00 GlobalRt-8 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"last_cursor":"0x","objects":[]},"id":4394}.
2024-05-15 07:22:08.721 +00:00 GlobalRt-1 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"last_cursor":"0x","objects":[]},"id":3780}.
2024-05-15 07:22:08.721 +00:00 GlobalRt-1 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"last_cursor":"0x","objects":[]},"id":285}.
2024-05-15 07:22:08.722 +00:00 GlobalRt-1 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"last_cursor":"0x","objects":[]},"id":3175}.
2024-05-15 07:22:08.723 +00:00 GlobalRt-1 DEBUG rpc  Response: {"jsonrpc":"2.0","result":{"last_cursor":"0x","objects":[]},"id":8217}.
/tmp/t
❯ cat logs/run.log | grep -i rpc | wc -l
116692

How's the CPU and IO load on the machine where the ckb process is running?

Which client is making a large number of RPC requests to the ckb node? Could you try turning off the client to see if it makes a difference?

silySuper commented 2 months ago
截屏2024-05-15 16 51 22

I does not find any other rpc requests to the ckb node.

截屏2024-05-15 16 52 50
eval-exec commented 2 months ago

I does not find any other rpc requests to the ckb node.

How about change [network].listen_addresses to another port, and restart?

silySuper commented 2 months ago

chang port 8115 to 8117 and restart

截屏2024-05-15 17 27 13

the same

quake commented 2 months ago

I does not find any other rpc requests to the ckb node.

How about change [network].listen_addresses to another port, and restart?

I think it should be [rpc].listen_address

silySuper commented 2 months ago

change from 8114 to 8115

截屏2024-05-15 18 36 20
eval-exec commented 2 months ago

change from 8114 to 8115

Has the height of the ckb node increased after an hour has passed? Is the Neuron client currently connected to this ckb node? I suspect that the Neron client is making a large number of RPC requests to the ckb node. Could you try shutting down the Neuron client and directly launching the ckb node? (Then provide the debug log

silySuper commented 2 months ago

after change port,it always shows connecting,so I change to port 8114 again.I want to try v115 ,but it always show not safe even I clicked allowed in Privacy and security of my computer.Whether we can find a faster way to solve this?because it has blocked me for about two weeks.

silySuper commented 2 months ago

step 1.

截屏2024-05-16 09 49 36

step 2.

截屏2024-05-16 09 49 48

step 3.

截屏2024-05-16 09 50 50
eval-exec commented 2 months ago

I'm sorry, I don't have experience with Mac. Would this link to Apple support be helpful? Apple 无法检查 App 是否包含恶意软件


silySuper commented 2 months ago

I operate already,it does not effect.

eval-exec commented 2 months ago

after change port,it always shows connecting

Could you try starting the ckb process after shutting down the Neoron client?

eval-exec commented 2 months ago

Whether we can find a faster way to solve this?because it has blocked me for about two weeks.

Do you know the absolute path of the ckb 0.114.0 binary file? Could you copy the previous data/db file, initialize ckb in a new directory, and try again?

  1. Initialize ckb configurations in a new_dir
    ./ckb init -C new_dir --chain testnet
  2. copy the previous data/db to new_dir/data/
    cp -R previous_dir/data/db new_dir/data/
  3. start the ckb
    ./ckb run -C new_dir
silySuper commented 2 months ago

This log try starting the ckb process after shutting down the Neoron client on changed port 8115. logs.zip

silySuper commented 2 months ago

Whether we can find a faster way to solve this?because it has blocked me for about two weeks.

Do you know the absolute path of the ckb 0.114.0 binary file? Could you copy the previous data/db file, initialize ckb in a new directory, and try again?

  1. Initialize ckb configurations in a new_dir
./ckb init -C new_dir --chain testnet
  1. copy the previous data/db to new_dir/data/
cp -R previous_dir/data/db new_dir/data/
  1. start the ckb
./ckb run -C new_dir

Ok,I will try this

eval-exec commented 2 months ago

Don't delete anything in the previous_dir just yet, we still need to investigate the cause of the "sync stuck" issue.

If you're able to sync smoothly after initializing the ckb configuration file in a new directory and using the copied data/db from before, it might be the case that some other temporary file in previous_dir is causing an issue.

Then we can investigate further in the previous_dir to see what exactly was the problem. First, backup the entire previous_dir: cp -R previous_dir previous_dir_backup

  1. Try moving data/network/peer_store/addr_manager.db to a different location, then start ckb with ckb run and see what happens.
  2. Try moving data/network/peer_store/ban_list.db to a different location, then start ckb with ckb run and see what happens.
  3. Try moving data/tx_pool to a different location, then start ckb with ckb run and see what happens.
  4. Try moving data/indexer to a different location, then start ckb with ckb run and see what happens.
silySuper commented 2 months ago

ckb node works fine now,but neuron can not sync

截屏2024-05-16 10 42 42
eval-exec commented 2 months ago

but neuron can not sync

What does this mean? What message does Neuron display? Have you tried again with the --indexer argument added?

./ckb run -C new_dir --indexer
silySuper commented 2 months ago
截屏2024-05-16 10 56 10

Ok now

eval-exec commented 2 months ago

截屏2024-05-16 10 56 10 Ok now

It appears that the Neuron client isn't connecting to the ckb process you started in new_dir. In your new_dir, the latest tip block should be higher than 13,245,789, but the Neuron client is showing "Block Synced is 24,584."

silySuper commented 2 months ago

yes,but my port is same to ckb.toml _ ./ckb run -C /Volumes/My\ Passport/ckb_v0.114.0_aarch64-apple-darwin-portable/testnetwork --indexer

silySuper commented 2 months ago
截屏2024-05-16 11 38 11

this is lateset node log

eval-exec commented 2 months ago

截屏2024-05-16 10 56 10 Ok now

What's Neuron's sync progress now?

silySuper commented 2 months ago
截屏2024-05-16 11 41 55
eval-exec commented 2 months ago
截屏2024-05-16 11 41 55

Could you run the command ps -eF | grep ckb to check if there are two ckb nodes running on your local machine? I suspect the ckb node that Neuron is connecting to is not the one you're running in /Volumes/My\ Passport/ckb_v0.114.0_aarch64-apple-darwin-portable/testnetwork.

silySuper commented 2 months ago
截屏2024-05-16 11 44 40
eval-exec commented 2 months ago

It appears the sync progress that Neuron displays pertains to the Indexer.

Previously, you only copied data/db from previous_dir to /Volumes/My\ Passport/ckb_v0.114.0_aarch64-apple-darwin-portable/testnetwork, which doesn't include the Indexer's data.

First, stop ckb process.

Now, you can move the Indexer data that has already synced to 10.63% in /Volumes/My\ Passport/ckb_v0.114.0_aarch64-apple-darwin-portable/testnetwork to another location:

mv  /Volumes/My\ Passport/ckb_v0.114.0_aarch64-apple-darwin-portable/testnetwork/data/indexer  backup_indexer

Then, move data/indexer from previous_dir to /Volumes/My\ Passport/ckb_v0.114.0_aarch64-apple-darwin-portable/testnetwork/data/indexer:

mv previous_dir/data/indexer /Volumes/My\ Passport/ckb_v0.114.0_aarch64-apple-darwin-portable/testnetwork/data/

Then start the ckb node and then check the sync progress in Neuron. It should now show over 90%.

silySuper commented 2 months ago
截屏2024-05-16 12 04 37

OK now

eval-exec commented 2 months ago

Whether we can find a faster way to solve this? Because it has blocked me for about two weeks.

That's great, you can continue working on this testnet node.

If you are not busy next, shall we continue to investigate the root cause of the sync being stuck? https://github.com/nervosnetwork/ckb/issues/4462#issuecomment-2113895635.

silySuper commented 2 months ago

ok,no problem.

silySuper commented 2 months ago

after run for a while ,it shows zsh: Input/output error: ./ckb

截屏2024-05-16 14 25 01
silySuper commented 2 months ago

It is the hard disk no react that cause the problem.Because I find that hard disk is null,when I Disconnect and reconnect,it is ok

15168316096 commented 2 months ago

Since the full node is started, the previous verification was on a Mac machine and a solid-state drive environment and no related problems were encountered. Can you provide your local system environment for starting the neuron wallet, such as the version of neuron? and macos system version and SSD information, or call meet directly to check your local environment.

eval-exec commented 1 month ago

It is the hard disk no react that cause the problem.Because I find that hard disk is null,when I Disconnect and reconnect,it is ok

Hello. Has this issue not recurred since you reconnected the hard drive? Are you able to reproduce this problem in your original environment again?