rsksmart / rskj

RSKj is a Java implementation of the Rootstock protocol.
https://rootstock.io
GNU Lesser General Public License v3.0
669 stars 265 forks source link

Fails connecting to peers #1726

Open ko0f opened 2 years ago

ko0f commented 2 years ago

Version 3.2.0 running on docker, fails to connect to any peer, including bootstrap nodes.

docker build -t rsk:latest -f Dockerfile . --no-cache
docker run -it -p 4444:4444 -p 4445:4445 -p 5050:5050 -v /rsk:/rsk rsk:latest

startup log -

2022-03-03-07:10:37.186 INFO [config]  Config ( yes ): user properties from -Drsk.conf.file file '/etc/rsk/node.conf'
2022-03-03-07:10:37.189 INFO [config]  Config ( yes ): default properties from installer '/etc/rsk/node.conf'
2022-03-03-07:10:37.206 ERROR [config]  Unexpected config value SimpleConfigObject({"enabled":true}) for key path `prune`. See expected.conf for the expected settings
2022-03-03-07:10:37.957 INFO [general]  *** Loaded up to block [4125457] totalDifficulty [5658280748780346201436008175] with stateRoot [254391daa6cafbd2c4b1b1d11b46fc3671ae4de9ef20616dd85f8ee46f69d9d4]
2022-03-03-07:10:37.963 INFO [general]  Public IP wasn't set or resolved, using https://checkip.amazonaws.com to identify it...
2022-03-03-07:10:38.810 INFO [general]  Identified public IP: /rem.oved.my.ip
2022-03-03-07:10:41.429 WARN [cors]  CORS header set to '*'
2022-03-03-07:10:41.683 INFO [fullnoderunner]  Starting RSK
2022-03-03-07:10:41.683 INFO [fullnoderunner]  Running rsk-mainnet.json,  core version: 3.2.0-IRIS
2022-03-03-07:10:41.683 INFO [fullnoderunner]  git.hash: [a4e57af]
2022-03-03-07:10:41.683 INFO [fullnoderunner]  build.branch: HEAD
2022-03-03-07:10:41.689 INFO [c.r.r.n.Web3HttpServer]  RPC HTTP enabled
2022-03-03-07:10:41.726 INFO [c.r.r.n.Web3WebSocketServer]  RPC WebSocket enabled
2022-03-03-07:10:41.728 INFO [c.r.n.d.UDPServer]  Discovery UDPListener started
2022-03-03-07:10:41.731 INFO [fullnoderunner]  Capability eth version: [V62]
2022-03-03-07:10:41.731 INFO [fullnoderunner]  done

but then it fails to connect to any peer, just keeps connecting and disconnecting right away -

2022-03-03-07:12:48.703 INFO [sync]  Peer bb702df16de16b55b8c812b8f05b647577d74da9d198f5cd5c9165170aba360feb4b53b54163ed1a16c9f59ff6cbf0ddf122a7cd9e701aed57953bbe72b7fd59: added to pool
2022-03-03-07:12:48.703 INFO [sync]  Peer 1d829f2e82914b31826f60589503b6727a9683207f61b8ce423bee7a77837c555bf8e7010503b393cc3141e770c3183efe74eb896e7bc46e8fdcfc3c02f43c47: added to pool
2022-03-03-07:12:48.703 INFO [sync]  Peer 6236222071ff9b4c3ddd4ce1230ad725b11fbcdb17d607f34843f8cb2e48b5fff6c27c92fc932c187ae6746a962f7103b8a57b04709104c37c318ba19c8093ae: added to pool
2022-03-03-07:12:48.703 INFO [sync]  Peer b93f4119888867d300eaaabdcfed2cfe952e09aa82c4be5ca9707a9195fec6153d5ec515b5351828d62051c3a0975afe4c68a10b3eb38f37b8989aba8a81b086: added to pool
2022-03-03-07:12:48.703 INFO [sync]  Peer 8b8f07ba446f9a5961f307977dc4f3b95368dbbcbca1dbdd5b115fc195eb92cdf49b98cb55da00253d654b84e8f3edde18e5280b54f1ccae20f84381a3bd7368: added to pool
2022-03-03-07:12:48.703 INFO [sync]  Peer 434f8932835719d4525b2bdaf8adb6b42b1865f36a98008070dc0fdc6859b199f7cc2145197943b58cec370cb1a9ac9ef191b1f9c3318f08642daaaddca94586: added to pool
2022-03-03-07:12:48.752 INFO [sync]  Peer 434f8932835719d4525b2bdaf8adb6b42b1865f36a98008070dc0fdc6859b199f7cc2145197943b58cec370cb1a9ac9ef191b1f9c3318f08642daaaddca94586: disconnected
2022-03-03-07:12:48.913 INFO [sync]  Peer 1d829f2e82914b31826f60589503b6727a9683207f61b8ce423bee7a77837c555bf8e7010503b393cc3141e770c3183efe74eb896e7bc46e8fdcfc3c02f43c47: disconnected
2022-03-03-07:12:49.076 INFO [sync]  Peer b93f4119888867d300eaaabdcfed2cfe952e09aa82c4be5ca9707a9195fec6153d5ec515b5351828d62051c3a0975afe4c68a10b3eb38f37b8989aba8a81b086: disconnected
2022-03-03-07:12:49.401 INFO [sync]  Peer 6236222071ff9b4c3ddd4ce1230ad725b11fbcdb17d607f34843f8cb2e48b5fff6c27c92fc932c187ae6746a962f7103b8a57b04709104c37c318ba19c8093ae: disconnected
2022-03-03-07:12:49.524 INFO [sync]  Peer bb702df16de16b55b8c812b8f05b647577d74da9d198f5cd5c9165170aba360feb4b53b54163ed1a16c9f59ff6cbf0ddf122a7cd9e701aed57953bbe72b7fd59: disconnected
2022-03-03-07:12:49.544 INFO [sync]  Peer 8b8f07ba446f9a5961f307977dc4f3b95368dbbcbca1dbdd5b115fc195eb92cdf49b98cb55da00253d654b84e8f3edde18e5280b54f1ccae20f84381a3bd7368: disconnected
Vovchyk commented 2 years ago

hi @ko0f, could you check this docker file? I did it on my end and it works.

ko0f commented 2 years ago

Yes @Vovchyk, that's the one I use.

Internet works, I have another RSK node with older version which works on the same network.

What possible reasons for instant disconnect are there?

Vovchyk commented 2 years ago

@ko0f, I can think of some issue with your network / firewall or docker configuration. First of all, I'd try to run rskj node outside docker to see if it's related to your network. Btw, you can use net_peerCount / net_peerList JSON_RPC methods to see how many peers have been connected, e.g. curl -X POST 'http://localhost:4444' -H 'Content-Type: application/json' --data '{ "jsonrpc":"2.0", "method":"net_peerCount", "params":[], "id":1 }'. web3_clientVersion and net_version are also handy api methods.

ko0f commented 2 years ago

Peer count {"jsonrpc":"2.0","id":67,"result":"0x0"} Client version RskJ/3.2.0/Linux/Java1.8/IRIS-a4e57af Net version 0

Other docker containers seem to work well. I've also opened a shell for RSK docker docker run -it -v "/rsk:/rsk" rsk:latest /bin/bash and pinged 1.1.1.1, got a pong.

ko0f commented 2 years ago

Here's a sample of the log -

2022-03-10-12:43:39.634 INFO [sync]  Connecting to: bootstrap09.rsk.co:5050
2022-03-10-12:43:39.651 INFO [sync]  Connecting to: bootstrap10.rsk.co:5050
2022-03-10-12:43:39.672 INFO [sync]  Connecting to: bootstrap03.rsk.co:5050
2022-03-10-12:43:39.688 INFO [sync]  Connecting to: bootstrap15.rsk.co:5050
2022-03-10-12:43:40.064 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:40.151 DEBUG [peerScoring]  Recorded SUCCESSFUL_HANDSHAKE. NodeID{4e906070623d61dc9d1c5c249f092541516168894db8d2fe277d2a6fe08b65642dfa7c756cbae95543b867eb475496616ab314652c8b0eb64474878a765491e8}, Address: 147.182.152.57
2022-03-10-12:43:40.172 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:40.172 DEBUG [syncprocessor]  Receiving syncState from node NodeID{4e906070623d61dc9d1c5c249f092541516168894db8d2fe277d2a6fe08b65642dfa7c756cbae95543b867eb475496616ab314652c8b0eb64474878a765491e8} block 4145354 549f552b968de392a6e8feb0d7c20608c85b80514fcdb8a202160d366140026e
2022-03-10-12:43:40.172 DEBUG [messageProcess]  Message[STATUS_MESSAGE] processed after [0.000061] seconds.
2022-03-10-12:43:40.424 INFO [sync]  Peer 4e906070623d61dc9d1c5c249f092541516168894db8d2fe277d2a6fe08b65642dfa7c756cbae95543b867eb475496616ab314652c8b0eb64474878a765491e8: added to pool
2022-03-10-12:43:41.041 INFO [sync]  Peer 4e906070623d61dc9d1c5c249f092541516168894db8d2fe277d2a6fe08b65642dfa7c756cbae95543b867eb475496616ab314652c8b0eb64474878a765491e8: disconnected
2022-03-10-12:43:41.172 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:42.173 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:42.702 DEBUG [discover]  14 Nodes retrieved from the PE.
2022-03-10-12:43:42.702 INFO [sync]  Connecting to: bootstrap03.rsk.co:5050
2022-03-10-12:43:42.702 INFO [sync]  Connecting to: bootstrap14.rsk.co:5050
2022-03-10-12:43:42.714 INFO [sync]  Connecting to: bootstrap04.rsk.co:5050
2022-03-10-12:43:42.726 INFO [sync]  Connecting to: bootstrap07.rsk.co:5050
2022-03-10-12:43:43.173 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:44.174 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:45.174 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:45.743 DEBUG [discover]  14 Nodes retrieved from the PE.
2022-03-10-12:43:45.743 INFO [sync]  Connecting to: bootstrap05.rsk.co:5050
2022-03-10-12:43:45.754 INFO [sync]  Connecting to: bootstrap01.rsk.co:5050
2022-03-10-12:43:46.174 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:46.632 DEBUG [peerScoring]  Recorded SUCCESSFUL_HANDSHAKE. NodeID{eb08c63444cdb9b71608bee19adc7fa119ec68bf04b96a5718757bc5818a49dc448b8f7e9eff5b7bb60cbc940bd2a72b1e9d0be7567e12eaa83a068ed669c910}, Address: 13.70.177.106
2022-03-10-12:43:46.652 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:46.652 DEBUG [syncprocessor]  Receiving syncState from node NodeID{eb08c63444cdb9b71608bee19adc7fa119ec68bf04b96a5718757bc5818a49dc448b8f7e9eff5b7bb60cbc940bd2a72b1e9d0be7567e12eaa83a068ed669c910} block 4145355 79f01ebe551382a88fceca9bf5e3eb5546d8786c7fe2cb76625277997a991d21
2022-03-10-12:43:46.652 DEBUG [messageProcess]  Message[STATUS_MESSAGE] processed after [0.000123] seconds.
2022-03-10-12:43:47.653 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:48.653 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:48.768 DEBUG [discover]  14 Nodes retrieved from the PE.
2022-03-10-12:43:48.768 INFO [sync]  Connecting to: bootstrap13.rsk.co:5050
2022-03-10-12:43:49.425 DEBUG [peerScoring]  Recorded SUCCESSFUL_HANDSHAKE. NodeID{9f27594562d2a11642d8737821a21771abd87e7b2587c6cb4fb778823153aa6e586fdee2b3b31fb999185185259a0b6d150c95db0b3369314fab8f88aeb6b874}, Address: 13.52.239.16
2022-03-10-12:43:49.445 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:49.445 DEBUG [syncprocessor]  Receiving syncState from node NodeID{9f27594562d2a11642d8737821a21771abd87e7b2587c6cb4fb778823153aa6e586fdee2b3b31fb999185185259a0b6d150c95db0b3369314fab8f88aeb6b874} block 4145355 79f01ebe551382a88fceca9bf5e3eb5546d8786c7fe2cb76625277997a991d21
2022-03-10-12:43:49.446 DEBUG [messageProcess]  Message[STATUS_MESSAGE] processed after [0.000059] seconds.
2022-03-10-12:43:50.446 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:51.446 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:51.788 DEBUG [discover]  14 Nodes retrieved from the PE.
2022-03-10-12:43:51.788 INFO [sync]  Connecting to: bootstrap02.rsk.co:5050
2022-03-10-12:43:51.806 INFO [sync]  Connecting to: bootstrap08.rsk.co:5050
2022-03-10-12:43:51.873 INFO [sync]  Connecting to: bootstrap12.rsk.co:5050
2022-03-10-12:43:52.141 DEBUG [peerScoring]  Recorded SUCCESSFUL_HANDSHAKE. NodeID{509af6b66c0c20cded9fe73024225731ba961e85cc241878a84f03c1c2e0614f0cbf5eea104e140d936a59db84aa5f95cffdd1ce018d4e8fdde22f514452a247}, Address: 34.252.232.189
2022-03-10-12:43:52.162 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:52.162 DEBUG [syncprocessor]  Receiving syncState from node NodeID{509af6b66c0c20cded9fe73024225731ba961e85cc241878a84f03c1c2e0614f0cbf5eea104e140d936a59db84aa5f95cffdd1ce018d4e8fdde22f514452a247} block 4145355 79f01ebe551382a88fceca9bf5e3eb5546d8786c7fe2cb76625277997a991d21
2022-03-10-12:43:52.162 DEBUG [messageProcess]  Message[STATUS_MESSAGE] processed after [0.000063] seconds.
2022-03-10-12:43:52.426 INFO [sync]  Peer 509af6b66c0c20cded9fe73024225731ba961e85cc241878a84f03c1c2e0614f0cbf5eea104e140d936a59db84aa5f95cffdd1ce018d4e8fdde22f514452a247: added to pool
2022-03-10-12:43:52.536 DEBUG [peerScoring]  Recorded SUCCESSFUL_HANDSHAKE. NodeID{34e3a8552f8af05ce564acfab0b1754774f9ed4c1e0413f179b94026a450f9af17b9a92b1fe8ba26ea51f59787bf88bf6592cf6d48e624da13392da17af94978}, Address: 13.56.115.33
2022-03-10-12:43:52.564 INFO [sync]  Peer 509af6b66c0c20cded9fe73024225731ba961e85cc241878a84f03c1c2e0614f0cbf5eea104e140d936a59db84aa5f95cffdd1ce018d4e8fdde22f514452a247: disconnected
2022-03-10-12:43:52.571 DEBUG [peerScoring]  Recorded SUCCESSFUL_HANDSHAKE. NodeID{8b8f07ba446f9a5961f307977dc4f3b95368dbbcbca1dbdd5b115fc195eb92cdf49b98cb55da00253d654b84e8f3edde18e5280b54f1ccae20f84381a3bd7368}, Address: 191.232.162.215
2022-03-10-12:43:52.590 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:52.590 DEBUG [syncprocessor]  Receiving syncState from node NodeID{8b8f07ba446f9a5961f307977dc4f3b95368dbbcbca1dbdd5b115fc195eb92cdf49b98cb55da00253d654b84e8f3edde18e5280b54f1ccae20f84381a3bd7368} block 4145355 79f01ebe551382a88fceca9bf5e3eb5546d8786c7fe2cb76625277997a991d21
2022-03-10-12:43:52.590 DEBUG [messageProcess]  Message[STATUS_MESSAGE] processed after [0.000061] seconds.
2022-03-10-12:43:52.746 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:52.746 DEBUG [syncprocessor]  Receiving syncState from node NodeID{34e3a8552f8af05ce564acfab0b1754774f9ed4c1e0413f179b94026a450f9af17b9a92b1fe8ba26ea51f59787bf88bf6592cf6d48e624da13392da17af94978} block 4145355 79f01ebe551382a88fceca9bf5e3eb5546d8786c7fe2cb76625277997a991d21
2022-03-10-12:43:52.746 DEBUG [messageProcess]  Message[STATUS_MESSAGE] processed after [0.000127] seconds.
2022-03-10-12:43:53.426 INFO [sync]  Peer 34e3a8552f8af05ce564acfab0b1754774f9ed4c1e0413f179b94026a450f9af17b9a92b1fe8ba26ea51f59787bf88bf6592cf6d48e624da13392da17af94978: added to pool
2022-03-10-12:43:53.485 INFO [sync]  Peer 34e3a8552f8af05ce564acfab0b1754774f9ed4c1e0413f179b94026a450f9af17b9a92b1fe8ba26ea51f59787bf88bf6592cf6d48e624da13392da17af94978: disconnected
2022-03-10-12:43:53.746 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:54.747 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:54.882 DEBUG [discover]  14 Nodes retrieved from the PE.
2022-03-10-12:43:54.882 INFO [sync]  Connecting to: bootstrap08.rsk.co:5050
2022-03-10-12:43:54.883 INFO [sync]  Connecting to: bootstrap12.rsk.co:5050
2022-03-10-12:43:55.056 DEBUG [peerScoring]  Recorded FAILED_HANDSHAKE. NodeID{509af6b66c0c20cded9fe73024225731ba961e85cc241878a84f03c1c2e0614f0cbf5eea104e140d936a59db84aa5f95cffdd1ce018d4e8fdde22f514452a247}, Address: 34.252.232.189
2022-03-10-12:43:55.747 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:56.747 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:57.748 DEBUG [messageProcess]  Queued Messages: 0
2022-03-10-12:43:57.883 DEBUG [discover]  14 Nodes retrieved from the PE.
2022-03-10-12:43:58.748 DEBUG [messageProcess]  Queued Messages: 0
ko0f commented 2 years ago

This may have been an unidentified connectivity issue, it stopped happening.

ko0f commented 2 years ago

This happens again now with one of my nodes - I have two, one is working fine, the other fails to find peers for hours now. Nothing in configuration has changed, I've just restarted the server.

Last time it managed to find peers after very long time, even so - it's an issue if it sometimes takes too long to connect to peers.

$ curl -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"net_peerCount","params":[],"id":67}' http://localhost:4444
{"jsonrpc":"2.0","id":67,"result":"0x0"}
Vovchyk commented 2 years ago

@ko0f , recently we indeed have had some temporary issues with peers connectivity. Please let us know if you're still facing it.

ko0f commented 2 years ago

Seems so, as long as the node is up, all is well. Once restarted it takes a very long time to find peers and some times (like just happened yesterday) it doesn't find any peers and keeps attempting bootstrap nodes, all while the other node I have is up and synced.

I've restarted the node yesterday and now, after a day it managed to find 2 peers and got synced. Just need to avoid restarts. :\

Vovchyk commented 2 years ago

@ko0f thanks for reporting this. We will check this out. This is definitely unexpected behaviour, there should not be such delay in finding peers.

Btw, you're still running the version RskJ/3.2.0/Linux/Java1.8/IRIS-a4e57af of the node in a docker container in mainnet?

You mentioned that you run two nodes and the issue happens only to one of them? Are they running on a single machine in docker? Do you use a default bridge network mode?

Is it true that you are not facing this issue if the node is running outside docker?

Could you please also share details of the environment you use (eg. hardware spec, OS, java version)?

ko0f commented 2 years ago

@Vovchyk this happens on both machines, once rskj is restarted.

Dockerfile -

#
# RSK Node Dockerfile
#

# Pull base image.
FROM ubuntu:latest

MAINTAINER RSK Release <support@rsk.co>

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update
RUN apt-get install -y --no-install-recommends openjdk-8-jre supervisor systemd software-properties-common

# Install RSK.
RUN groupadd --gid 888 rsk && useradd rsk -d /var/lib/rsk -s /sbin/nologin --uid=888 --gid=888

RUN \
  add-apt-repository -y ppa:rsksmart/rskj && \
  apt-get update && \
  (echo rskj shared/accepted-rsk-license-v1-1 select true | /usr/bin/debconf-set-selections )&& \
  apt-get install -y --no-install-recommends rskj && \
  apt-get clean && \
  rm -rf /var/lib/apt/lists/*

# Supervisod CONF
COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf
COPY node.conf /etc/rsk/node.conf
COPY logback.xml /etc/rsk/logback.xml
COPY startrsk /usr/bin/startrsk

RUN chmod +x /usr/bin/startrsk

## MAINNET
EXPOSE 4444
EXPOSE 4445
EXPOSE 5050

# Define default command.
#CMD ["/usr/bin/supervisord"]
CMD ["/usr/bin/startrsk"]

node.conf peer section, which I added during my attempts to fix this -

peer {

    discovery = {
        enabled = true
        ip.list = ["10.0.0.2:5050","bootstrap01.rsk.co:5050","bootstrap02.rsk.co:5050","bootstrap03.rsk.co:5050","bootstrap04.rsk.co:5050","bootstrap05.rsk.co:5050","bootstrap06.rsk.co:5050","bootstrap07.rsk.co:5050","bootstrap08.rsk.co:5050","bootstrap09.rsk.co:5050","bootstrap10.rsk.co:5050","bootstrap11.rsk.co:5050","bootstrap12.rsk.co:5050","bootstrap13.rsk.co:5050","bootstrap14.rsk.co:5050","bootstrap15.rsk.co:5050","bootstrap16.rsk.co:5050","18.166.58.153:5050","8.214.48.216:5050","213.94.62.19:5050","64.225.55.114:5050","52.53.166.254:5050","204.236.246.36:5050"]
       bind.ip = 0.0.0.0
    }
   active = [
    ]

    trusted = [
    ]

    listen.port = 5050
   connection.timeout = 5
   channel.read.timeout = 30
   maxActivePeers = 30
}

Will update to 3.3.0 now and check again.

ko0f commented 2 years ago

It's been two hours and still, with 3.3.0 net_peerCount returns {"jsonrpc":"2.0","id":67,"result":"0x0"}

Vovchyk commented 2 years ago

thanks @ko0f for the update. We've been investigating this.