networkservicemesh / sdk-vpp

Apache License 2.0
2 stars 18 forks source link

Kernel(Memif) + Wireguard is unstable on Calico #527

Open glazychev-art opened 2 years ago

glazychev-art commented 2 years ago

Description

Note: This issue was caught when using Calico vpp

Sometimes ping doesn't work, if we use kernel (or memif) + wireguard interfaces. It is enough to run Kernel2Wireguard2Kernel example .

Additional info:

I've added traces and found that the problem is most often here:

NSC ----> FWD1 ----> FWD2 ----> NSE
NSC -xxx- FWD1 <---- FWD2 <---- NSE

FWD1 information. This trace is on the backward, from the FWD2:

``` vpp# show trace: ``` ``` Packet 3 02:08:36:559914: af-packet-input af_packet: hw_if_index 1 next-index 4 tpacket2_hdr: status 0x1 len 158 snaplen 158 mac 66 net 80 sec 0x62319535 nsec 0x159e4b2 vlan 0 vlan_tpid 0 02:08:36:559925: ethernet-input IP4: 02:42:ac:12:00:03 -> 02:42:ac:12:00:04 02:08:36:559931: ip4-input UDP: 172.18.0.3 -> 172.18.0.4 tos 0x00, ttl 63, length 144, checksum 0x2332 dscp CS0 ecn NON_ECN fragment id 0x0000 UDP: 51820 -> 51820 length 124, checksum 0x0000 02:08:36:559936: cnat-input-ip4 session not found in:host-eth0 out:DELETED 02:08:36:559944: ip4-lookup fib 0 dpo-idx 20 flow hash: 0x00000000 UDP: 172.18.0.3 -> 172.18.0.4 tos 0x00, ttl 63, length 144, checksum 0x2332 dscp CS0 ecn NON_ECN fragment id 0x0000 UDP: 51820 -> 51820 length 124, checksum 0x0000 02:08:36:559948: ip4-receive UDP: 172.18.0.3 -> 172.18.0.4 tos 0x00, ttl 63, length 144, checksum 0x2332 dscp CS0 ecn NON_ECN fragment id 0x0000 UDP: 51820 -> 51820 length 124, checksum 0x0000 02:08:36:559951: ip4-udp-lookup UDP: src-port 51820 dst-port 51820 02:08:36:559953: wg4-input Wireguard input: Type: Data Peer: 0 Length: 84 Keepalive: false 02:08:36:560288: ip4-input-no-checksum ICMP: 172.16.1.100 -> 172.16.1.101 tos 0x00, ttl 63, length 84, checksum 0x8351 dscp CS0 ecn NON_ECN fragment id 0x9d6e ICMP echo_reply checksum 0x3e2a id 62 02:08:36:560295: l3xc-input-ip4 l3xc-index:0 lb-index:48 02:08:36:560300: ip4-rewrite tx_sw_if_index 12 dpo-idx 48 : ipv4 via 0.0.0.0 tun5: mtu:8920 next:12 flags:[] flow hash: 0x00000000 00000000: 450000549d6e00003e018451ac100164ac10016500003e2a003e00113f6a821c 00000020: 00000000000000000000000000000000000000000000000000000000 02:08:36:560302: interface-12-output-deleted tun5 00000000: 450000549d6e00003e018451ac100164ac10016500003e2a003e00113f6a821c 00000020: 0000000000000000000000000000000000000000000000000000000000000000 00000040: 0000000000000000000000000000000000000000552b95de843e0a103afc478b 00000060: 63032fb80cd7034820f49dff7d10b1144dd7a61d261a9260 02:08:36:560304: error-drop rx:wg0 02:08:36:560305: drop interface-12-output-deleted: interface is deleted ```
``` vpp# show l3xc ``` ``` l3xc:[0]: wg0 path-list:[179] locks:1 flags:shared,no-uRPF, uRPF-list: None path:[162] pl-index:179 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, 172.16.1.101 tun5 (p2p) [@0]: ipv4 via 0.0.0.0 tun5: mtu:8920 next:12 flags:[] [@4]: ipv4 via 0.0.0.0 tun5: mtu:8920 next:12 flags:[] l3xc:[1]: wg0 path-list:[168] locks:1 flags:shared,no-uRPF, uRPF-list: None path:[161] pl-index:168 ip6 weight=1 pref=0 attached: oper-flags:resolved, tun5 [@3]: ipv6 via :: tun5: mtu:8920 next:15 flags:[] l3xc:[2]: tun5 path-list:[167] locks:1 flags:shared,no-uRPF, uRPF-list: None path:[159] pl-index:167 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, 172.16.1.100 wg0 [@0]: ipv4 [features] via 172.16.1.100 wg0: mtu:8920 next:8 flags:[features ] 00000000: 4500000000000000401122c2ac120004ac120003ca6cca6c0000000000000000 00000020: 000000000000000000000000 stacked-on entry:55: [@2]: ipv4 via 172.18.0.3 host-eth0: mtu:1500 next:5 flags:[features ] 0242ac1200030242ac1200040800 [@3]: ipv4 [features] via 172.16.1.100 wg0: mtu:8920 next:8 flags:[features ] 00000000: 4500000000000000401122c2ac120004ac120003ca6cca6c0000000000000000 00000020: 000000000000000000000000 stacked-on entry:55: [@2]: ipv4 via 172.18.0.3 host-eth0: mtu:1500 next:5 flags:[features ] 0242ac1200030242ac1200040800 l3xc:[3]: tun5 path-list:[180] locks:1 flags:shared,no-uRPF, uRPF-list: None path:[208] pl-index:180 ip6 weight=1 pref=0 attached: oper-flags:resolved, wg0 [@1]: dpo-drop ip6 ```
``` vpp# show int ``` ``` Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count host-eth0 1 up 1500/0/0/0 rx packets 1387688 rx bytes 1729621362 tx packets 727018 tx bytes 169224852 drops 2435 punt 198 ip4 1384504 ip6 3154 ipip0 3 up 9000/0/0/0 rx packets 24012 rx bytes 7451017 tx packets 25636 tx bytes 9032818 ip4 24012 ipip1 6 up 9000/0/0/0 rx packets 38450 rx bytes 7956542 tx packets 45448 tx bytes 7614480 drops 44 ip4 38450 local0 0 down 0/0/0/0 drops 1 loop0 4 down 9000/0/0/0 loop1 17 down 9000/0/0/0 loop2 7 down 9000/0/0/0 loop3 9 down 9000/0/0/0 tap0 2 up 9216/0/0/0 rx packets 638394 rx bytes 152464785 tx packets 1349584 tx bytes 1718452054 drops 19 ip4 635362 ip6 3013 tun1 5 up 9216/0/0/0 rx packets 25086 rx bytes 10375976 tx packets 26789 tx bytes 3804974 drops 12 ip4 25074 ip6 12 tun2 18 up 9216/0/0/0 rx packets 1046 rx bytes 230263 tx packets 1092 tx bytes 121961 drops 8 ip4 1038 ip6 8 tun3 13 up 9216/0/0/0 rx packets 2493 rx bytes 353878 tx packets 2347 tx bytes 464337 drops 8 ip4 2485 ip6 8 tun4 10 up 9216/0/0/0 rx packets 13 rx bytes 768 drops 9 ip4 4 ip6 9 tun5 12 up 8920/8920/8920/8920 rx packets 128 rx bytes 10464 drops 8 ip4 120 ip6 8 wg0 11 up 8920/8920/8920/8920 tx packets 120 tx bytes 17280 drops 120 ip4 120 ```
``` vpp# show node ip4-rewrite ``` ``` node ip4-rewrite, type internal, state active, index 601 node function variants: Name Priority Active Description icl -1 Intel Ice Lake skx -1 Intel Skylake (server) / Cascade Lake hsw 50 yes Intel Haswell default 0 default next nodes: next-index node-index Node Vectors 0 593 ip4-drop 0 1 609 ip4-icmp-error 0 2 536 ip4-frag 0 3 405 gso-ip4 0 4 257 cnat-output-ip4 2192208 5 681 host-eth0-output 0 6 683 tap0-output 0 7 276 acl-plugin-out-ip4-fa 1440390 8 381 tunnel-output 0 9 687 tun1-output 0 10 691 interface-8-output-deleted 0 11 695 loop3-output 11654 12 699 interface-12-output-deleted 19727 13 703 interface-14-output-deleted 0 14 9 wg4-output-tun 0 15 705 interface-15-output-deleted 32 16 664 interface-output 0 17 709 interface-16-output-deleted 0 18 717 interface-20-output-deleted 0 19 693 tun3-output 0 20 713 tun2-output 0 known previous nodes: srv6-as-localsid (47) srv6-ad-flow-localsid (52) srv6-ad-localsid (56) lisp-tunnel-output (148) l3xc-input-ip4 (173) cnat-input-ip4 (259) lookup-ip4-src (367) lookup-ip4-dst-itf (368) lookup-ip4-dst (369) tunnel-output-no-count (380) tunnel-output (381) adj-midchain-tx (382) sr-localsid-un-perf (415) sr-localsid-un (416) sr-localsid (417) sr-localsid-d (418) tcp4-output (462) ip4-frag (536) ip4-punt-redirect (594) ip4-load-balance (605) ip4-lookup (606) ip4-classify (614) vxlan4-encap (623) ```

My guess is that many interfaces are created and deleted during the tests (both NSM and Calico). And at some point, the state of the list of interfaces is violated (perhaps due to reallocation)

glazychev-art commented 2 years ago

@edwarnicke Do you have any thoughts?

glazychev-art commented 2 years ago

Similar symptoms: https://github.com/networkservicemesh/integration-k8s-kind/actions/runs/2060488106

glazychev-art commented 2 years ago

https://github.com/networkservicemesh/integration-k8s-packet/actions/runs/2081993976

glazychev-art commented 2 years ago

I've tried to reproduce it locally on the bare VPP, but without success. To check this issue it would be very helpful if we could use the latest VPP revision. It contains several patches for wireguard, ip, vnet that would affect this problem. To do this, we need to wait for an update in Calico-VPP - https://github.com/projectcalico/vpp-dataplane/blob/master/vpplink/binapi/vpp_clone_current.sh#L87 According to the results of the last communication with the Calico-guys, they have upgrading the VPP version in mind, but given the amount of negotiation, they'll probably delay as long as nothing major breaks