rsksmart / rskj

RSKj is a Java implementation of the Rootstock protocol.
https://rootstock.io
GNU Lesser General Public License v3.0
669 stars 265 forks source link

New testnet node not starting initial block sync #2183

Open raucao opened 8 months ago

raucao commented 8 months ago

I set up two new VMs containing one testnet and one mainnet node on a new host. The mainnet node started initial sync normally after a short while. However, the testnet node has not started syncing a single block after 2 weeks of trying.

The logs will occasionally report bootstrap peers having been found, but there are no errors or warnings reporting any issues with sync.

2023-11-19-15:59:19.414 DEBUG [c.r.n.d.PeerExplorer]  New Peer found or id changed: ip[bootstrap02.testnet.rsk.co] port[50505] id [NodeID{d6f2321604dcb136d2bb14eae2ebdacffc175d5bc2ae56a8a1e053e62edd38a072c9c616bc3c7bf73455f503aa3a2589cf643798f31053d8ff42aba4c1c1394b}]
2023-11-19-15:59:19.476 DEBUG [c.r.n.d.PeerExplorer]  New Peer found or id changed: ip[bootstrap06.testnet.rsk.co] port[50505] id [NodeID{78ba2785b14576495a8299952dab590af00c43cdedf1a398a162b98479ceec2267bf579ee8dd0ccdc27ac7dc416c1f681ee7955ac0e1b42b0b00cc6663177de9}]
2023-11-19-15:59:19.477 DEBUG [c.r.n.d.PeerExplorer]  New Peer found or id changed: ip[bootstrap01.testnet.rsk.co] port[50505] id [NodeID{0b9ccb3896217ad9208cdd40c6a867a88aa0024f26ee04f78eec30fb6030a5d79639adb687667d70df3c79a4c4a8e99be766ca4386d18298af0a00af1dae9da4}]
2023-11-19-15:59:19.556 DEBUG [c.r.n.d.PeerExplorer]  New Peer found or id changed: ip[bootstrap03.testnet.rsk.co] port[50505] id [NodeID{42c45daab94c3b38daa5b323274cc80ae46741e420ea91f96be48c3711b861904458a8d50531f25f355a407c156a93adc22f714befc0919f094cc82d375b8c74}]
2023-11-19-15:59:19.559 DEBUG [c.r.n.d.PeerExplorer]  New Peer found or id changed: ip[bootstrap05.testnet.rsk.co] port[50505] id [NodeID{af7db902f8b1713b2d6b9e57d28928d519bfd67882d91fc3ff891bbafbd0fbfd986e8724fda8d0187c4459c00605e8f203f320d8cbf43f4e484129daeb756d4e}]
2023-11-19-15:59:19.572 DEBUG [c.r.n.d.PeerExplorer]  New Peer found or id changed: ip[bootstrap04.testnet.rsk.co] port[50505] id [NodeID{fd403b30e37f1bee50b03d7a71a1af82f2af9ee95fd49add15803eec94947f51f6235ec3fdc0c3aa18359933729c2a1514f9ae18ea07fc8f5219578ce87b6bd4}]
2023-11-19-15:59:20.284 DEBUG [messageProcess]  Queued Messages: 0
2023-11-19-15:59:21.285 DEBUG [messageProcess]  Queued Messages: 0
2023-11-19-15:59:21.555 DEBUG [discover]  6 Nodes retrieved from the PE.
2023-11-19-15:59:22.285 DEBUG [messageProcess]  Queued Messages: 0
2023-11-19-15:59:23.285 DEBUG [messageProcess]  Queued Messages: 0
2023-11-19-15:59:24.285 DEBUG [messageProcess]  Queued Messages: 0
2023-11-19-15:59:24.555 DEBUG [discover]  6 Nodes retrieved from the PE.
2023-11-19-15:59:25.285 DEBUG [messageProcess]  Queued Messages: 0

How can I debug why sync is not starting?

Vovchyk commented 8 months ago

From what I can see in the logs, your node struggles to discover network peers (other than the bootstrap ones, a.k.a. Bootstrap-Only nodes). In testnet the bootstrap nodes provide only one service - discovery of other peers - that's why your node wasn't able to fetch blocks and start syncing without being connected to at least one "full" node.

The reason why your node wasn't able to find (discover) full nodes in the network might be related to the fact that so-called "buckets" in an internal table of boot nodes that correspond to your nodeId were already filled up with other peers. Usually this is a rare case, but sometimes it happens. You can find more info on how the discovery protocol works, for instance, in here.

We are now working on improving the UX of initial bootstrapping process in Testnet, eg. recently we increased the number of boot nodes in this PR, but that hasn't been released yet.

What you can do by now is following:

raucao commented 8 months ago

Thank you for the explanation.

I have tried all of the suggested mitigations, but without luck so far. I had already restarted the node multiple times before, because I remembered how that seemed to fix sync not starting in the past.

I do still have another testnet node running on a different machine, and also added that to the bootstrap list. I do get a couple more IP addresses that aren't the normal bootstrap nodes now, but still no sync. Is there a way that I can tell my own nodes to immediately sync with each other perhaps? They are both on the same private network, so maybe I could even add a rule for prioritized networks or something?

Vovchyk commented 7 months ago

yes, you can specify nodes that you trust or that you want to connect to in your config file - this should also help with bootstrapping / starting a sync. Check the config sections out

raucao commented 7 months ago

Great! Adding the new node to the existing node's trusted list, and connecting by default to it from the new node solved my issue. Thank you!

The reported original issue still exists when you do not have a trusted node to connect to, but just want to start initial sync via the normal discovery process. So I think this should be kept open until it's been confirmed to work reliably for new users.