hyperledger / besu

An enterprise-grade Java-based, Apache 2.0 licensed Ethereum client https://wiki.hyperledger.org/display/besu
https://www.hyperledger.org/projects/besu
Apache License 2.0
1.51k stars 831 forks source link

Discovery is not taking NATManager portMapping into account #6573

Open daanporon opened 8 months ago

daanporon commented 8 months ago

Description

This bug report is related to the limitations described here. But I wanted to write down my findings, not sure these can be resolved or maybe I am missing something.

We are encountering an issue while setting up Besu networks on different Kubernetes clusters with the aim of enabling discovery between nodes across clusters.

We setup one node on cluster A and a second node on cluster B. When starting the node on cluster B we configured the enode from the node on cluster A as a bootnode. Both nodes have a LoadBalancer configured with a different discovery port for udp than the rlpx port for tcp as described in the limitations. Next to this we also configured the KubernetesNATManager which discovers the portMapping from the LoadBalancer. All of this seems to be working correct, if i use admin_nodeInfo it gives me the right enode url with the discport which was found by the KubernetesNATManager.

The problems are starting to happen when the discovery mechanism starts to do its thing. The PING/PONG mechanism is not taking the portForwarding from the KubernetesNATManager into account. You see the node on cluster B sending a PING with the "wrong" udpPort configured. I'm putting wrong between "", because it is the udpPort configured on the besu service itself, but not the one from the LoadBalancer.

<<< Sending PING  packet to peer 0x5cd80bca79f839868378322848d25ea3... (enode://5cd80bca79f839868378322848d25ea3ae38ab599fa26fcab5b350e57d4e548edf05c4057cb7183a4cd75987cc819259e4b2f37d270dd6ddb76b4b75322a25da@main6n1-ebb4p.settlemint.com:30303?discport=40404): Packet{type=PING, data=PingPacketData{from=Endpoint{host='Optional[18.159.175.159]', udpPort=30303, getTcpPort=30303}, to=Endpoint{host='Optional[3.20.129.233]', udpPort=40404, getTcpPort=30303}, expiration=1707901552, enrSeq=2}, hash=0x9810626e986b844cc2106d610e39ae301a9f9d7665e3006ac7934dd3514a3d4d, signature=Signature{r=72617991622119169117905822478728547836489852853852711286764448393432623770874, s=49677976159185566525663197219905632715322234062224321742987821943653454714732, recId=1}, publicKey=0x9440beacdc22fa80e18f1ca9093c79d7ff520d99b8223e15aab7279a2949794d55bbaa3077172b9f5c8c44eb08394d33b763591b3a6ef48d3cbf70bed31bd333}

The node on cluster B receives this PING and you see it has a very wrong enode URL:

>>> Received PING  packet from peer 0x9440beacdc22fa80e18f1ca9093c79d7... (enode://9440beacdc22fa80e18f1ca9093c79d7ff520d99b8223e15aab7279a2949794d55bbaa3077172b9f5c8c44eb08394d33b763591b3a6ef48d3cbf70bed31bd333@3.77.49.232:30303?discport=32283): Packet{type=PING, data=PingPacketData{from=Endpoint{host='Optional[18.159.175.159]', udpPort=30303, getTcpPort=30303}, to=Endpoint{host='Optional[3.20.129.233]', udpPort=40404, getTcpPort=30303}, expiration=1707901552, enrSeq=2}, hash=0x9810626e986b844cc2106d610e39ae301a9f9d7665e3006ac7934dd3514a3d4d, signature=Signature{r=72617991622119169117905822478728547836489852853852711286764448393432623770874, s=49677976159185566525663197219905632715322234062224321742987821943653454714732, recId=1}, publicKey=0x9440beacdc22fa80e18f1ca9093c79d7ff520d99b8223e15aab7279a2949794d55bbaa3077172b9f5c8c44eb08394d33b763591b3a6ef48d3cbf70bed31bd333}

The enode returned from admin_nodeInfo is this: enode://9440beacdc22fa80e18f1ca9093c79d7ff520d99b8223e15aab7279a2949794d55bbaa3077172b9f5c8c44eb08394d33b763591b3a6ef48d3cbf70bed31bd333@18.159.175.159:30303?discport=40404 so the ip address is wrong, but also the discport is wrong. The issue with the ip address is solved here in PR #6225 i tested this and this seems to work. And the discport i guess is the port from the connection which was set up for the udp protocol by the node on cluster B. The logs are still from a test case i did on the latest 24.1.2 branch that's why the ip address is wrong.

In the end this discovery between these two nodes succeeds at least pre PR #6225, because the node on cluster A can talk to the node on cluster B and the other way around. But things go wrong if i set up another node on cluster C with the node of cluster A as bootnode. It will receive the neighbours of cluster A, which contains the node on cluster B but with this enode uri it had build during the discovery phase. And the node on cluster C cannot connect the node on cluster B using this enode uri, since it doesn't go through the loadbalancer.

after PR #6225 the PING/PONG between the node on cluster A and the node on cluster B also doesn't succeed because the enode using the ip address from the PingPacketData which is the ip address of the LoadBalancer but with a wrong discport.

Acceptance Criteria

Steps to Reproduce (Bug)

  1. Setup a node on cluster A
  2. Setup a node on cluster B
    • with a loadbalancer that has a separate udp and tcp port for discovery
    • configure the KubernetesNATManager to look at that loadbalancer
    • configure the enode of the node on cluster A as a bootnode
    • also p2p and discovery should be enabled
  3. When enabling TRACE logs you will see the discovery mechanism communication between the two nodes.

Related issues:

daanporon commented 8 months ago

I created a proposal to fix this: https://github.com/hyperledger/besu/pull/6578