Problem is that after restarting a node it much longer than expected take time to connect.
node1 is always running, node2 is one which is reconnecting
After node2 is disconnecting, connLoop constantly retries this connection up until it either will be READY again or after 30*3s will pass.
What happens if node2 is getting started again - node2 is dialing node1, grpc dial is successful, credentials check as well, so we store this connection to peers and start connLoop goroutine. Meanwhile node1 is trying to create its client back to node2 in async way, but it fails, since it ALREADY HAS connection to it in peers map and its broken (its in transient failure state and did not yet recover). So node1 is closing new connection from node2. Then somewhere around same time node2 after dialing node1 tries to Ping it using closed connection so ping fails and now both node1 and node2 have connections to each other which are broken and being constantly retried. After 30 tries one of the nodes creates new connection, but, if other node already has broken connection to it nothing will happen and this loop will continue. Situation resolves, but non deterministic and only when both nodes are getting fresh connections at the same time.
I agree with proposed by @rkeene solution, which is in this PR - do not close incoming connection if we're not able to dial it back. This way incoming connection remains and node which tried to connect to us will have working connection and will be able to gather information from us even if we do not benefit from this communication. Meanwhile if we already have connection to this node but it in failure state (like in example above) it will be recovered by means of grpc very shortly. This also resolves cases if user's computers do not have required ports open for example.
Problem is that after restarting a node it much longer than expected take time to connect.
node1 is always running, node2 is one which is reconnecting After node2 is disconnecting,
connLoop
constantly retries this connection up until it either will be READY again or after 30*3s will pass.What happens if node2 is getting started again - node2 is dialing node1, grpc dial is successful, credentials check as well, so we store this connection to peers and start
connLoop
goroutine. Meanwhile node1 is trying to create its client back to node2 in async way, but it fails, since it ALREADY HAS connection to it inpeers
map and its broken (its in transient failure state and did not yet recover). So node1 is closing new connection from node2. Then somewhere around same time node2 after dialing node1 tries toPing
it using closed connection so ping fails and now both node1 and node2 have connections to each other which are broken and being constantly retried. After 30 tries one of the nodes creates new connection, but, if other node already has broken connection to it nothing will happen and this loop will continue. Situation resolves, but non deterministic and only when both nodes are getting fresh connections at the same time.I agree with proposed by @rkeene solution, which is in this PR - do not close incoming connection if we're not able to dial it back. This way incoming connection remains and node which tried to connect to us will have working connection and will be able to gather information from us even if we do not benefit from this communication. Meanwhile if we already have connection to this node but it in failure state (like in example above) it will be recovered by means of grpc very shortly. This also resolves cases if user's computers do not have required ports open for example.