Load balancer example tcp connection to application server is not working - Githubissues

networkservicemesh / examples

Network Service Mesh examples repo

Apache License 2.0

15 stars 27 forks source link

Load balancer example tcp connection to application server is not working #90

Open mardim91 opened 4 years ago

mardim91 commented 4 years ago

The nc connection to the application server through the "nc 10.2.2.0 5001" command is not working.

Executing tcpdump commands in application server in the nsm0 interface I observe that the SRC ip of the encapsulated packet is not the 10.70.0.0 but some random IP. Something is not working very well in load balancer plugin when it comes to the TCP connections. The ICMP connections are working fine and the source IPs are 10.70.0.0.

Steps to reproduce:

Deploy the load-balancer example
Login to the application server pod and execute tcpdump -i nsm0
login to load balancer pod and execute "nc 10.2.2.0 5001"
Check the Source IPs of the encapsulated packet.

nickolaev commented 4 years ago

Could that be a VPP problem?

edwarnicke commented 4 years ago

@uablrek thoughts?

uablrek commented 4 years ago

Access works when initiated from outside the cluster, i.e when the k8s-node forwards the traffic. When traffic is initiated from the k8s-node itself it seem to fail. I can't see how linux can mess this up so IMHO the fault must be in NSM (vpp?).

uablrek commented 4 years ago

tcpdump inside an application-server POD

When traffic is initiated from the k8s-node the src is trashed as described;

$ tcpdump -lni nsm0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on nsm0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:45:15.856313 IP 10.60.1.1 > 10.60.1.3: GREv0, length 64: IP 9.74.2.1.37217 > 10.2.2.2.5001: Flags [S], seq 3667427917, win 64240, options [mss 1460,sackOK,TS val 1802552007 ecr 0,nop,wscale 7], length 0
08:45:16.868429 IP 10.60.1.1 > 10.60.1.3: GREv0, length 64: IP 5.84.2.1.37217 > 10.2.2.2.5001: Flags [S], seq 3667427917, win 64240, options [mss 1460,sackOK,TS val 1802553021 ecr 0,nop,wscale 7], length 0

But when traffic is initiated from outside the cluster it works;

$ tcpdump -lni nsm0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on nsm0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:48:16.305146 IP 10.60.1.1 > 10.60.1.3: GREv0, length 64: IP 192.168.1.201.34401 > 10.2.2.2.5001: Flags [S], seq 1134753072, win 64240, options [mss 1460,sackOK,TS val 3857117458 ecr 0,nop,wscale 6], length 0
08:48:16.305339 IP 10.2.2.2.5001 > 192.168.1.201.34401: Flags [S.], seq 2880819978, ack 1134753073, win 65160, options [mss 1460,sackOK,TS val 3266929533 ecr 3857117458,nop,wscale 7], length 0
08:48:16.329451 IP 10.60.1.1 > 10.60.1.3: GREv0, length 56: IP 192.168.1.201.34401 > 10.2.2.2.5001: Flags [.], ack 1, win 1004, options [nop,nop,TS val 3857117477 ecr 3266929533], length 0
08:48:16.332074 IP 10.2.2.2.5001 > 192.168.1.201.34401: Flags [P.], seq 1:37, ack 1, win 510, options [nop,nop,TS val 3266929559 ecr 3857117477], length 36
08:48:16.332461 IP 10.2.2.2.5001 > 192.168.1.201.34401: Flags [F.], seq 37, ack 1, win 510, options [nop,nop,TS val 3266929560 ecr 3857117477], length 0
08:48:16.349096 IP 10.60.1.1 > 10.60.1.3: GREv0, length 56: IP 192.168.1.201.34401 > 10.2.2.2.5001: Flags [.], ack 37, win 1004, options [nop,nop,TS val 3857117501 ecr 3266929559], length 0
08:48:16.389033 IP 10.60.1.1 > 10.60.1.3: GREv0, length 56: IP 192.168.1.201.34401 > 10.2.2.2.5001: Flags [.], ack 38, win 1004, options [nop,nop,TS val 3857117545 ecr 3266929560], length 0
08:48:17.553320 IP 10.60.1.1 > 10.60.1.3: GREv0, length 56: IP 192.168.1.201.34401 > 10.2.2.2.5001: Flags [F.], seq 1, ack 38, win 1004, options [nop,nop,TS val 3857118706 ecr 3266929560], length 0
08:48:17.553450 IP 10.2.2.2.5001 > 192.168.1.201.34401: Flags [.], ack 2, win 51

uablrek commented 4 years ago

Note that it is the first 16 bits in the src address that are over-written with some garbage. Last 16 bit are ok.

mardim91 commented 4 years ago

From some investigation that I did long ago i could isolate the problem and my conclusion is that the problem must be on the load balancer vpp plugin. Everything else looks alright until the traffic reaches the tunnel that is created from the load balancer inside vpp towards the application server. There the traffic gets messed up. So my best bet would be that the bug is on the vpp load balancer plugin side and on the way that sets up the tunnel.

uablrek commented 4 years ago

BTW This problem did not exist when the example was submitted.

uablrek commented 4 years ago

16-bit and very random, a misplaced CRC?

edwarnicke commented 4 years ago

@uablrek Might be good to poke vpp-dev