open-switch / opx-nas-l2

https://openswitch.net
3 stars 11 forks source link

Traffic drop on OPX VM #21

Open vnam1 opened 6 years ago

vnam1 commented 6 years ago

I have an OPX VM that has a bridge configured ‘br10’

br10 Link encap:Ethernet HWaddr 52:54:00:13:99:28 inet addr:172.16.2.1 Bcast:0.0.0.0 Mask:255.255.255.0 inet6 addr: fe80::5054:ff:fe13:9928/64 Scope:Link UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1 RX packets:3570 errors:0 dropped:0 overruns:0 frame:0 TX packets:2631 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:297904 (290.9 KiB) TX bytes:177878 (173.7 KiB)

untagged ports in the bridge : e101-003-0

A server (Ubuntu VM) is connected to the bridge port e101-003-0 on OPX

server interface config: eth1 Link encap:Ethernet HWaddr 52:54:00:ca:ef:7c inet addr:172.16.2.5 Bcast:172.16.2.255 Mask:255.255.255.0 inet6 addr: fe80::5054:ff:feca:ef7c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:3233 errors:0 dropped:0 overruns:0 frame:0 TX packets:8613 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:358302 (358.3 KB) TX bytes:871479 (871.4 KB)

When I run some ping tests from server to the bridge interface address (src: 172.16.2.5, dest: 172.16.2.1) there is packet loss (look at the missing sequence numbers below)

root@localhost:~# ping 172.16.2.1 PING 172.16.2.1 (172.16.2.1) 56(84) bytes of data. 64 bytes from 172.16.2.1: icmp_seq=1 ttl=64 time=2.13 ms 64 bytes from 172.16.2.1: icmp_seq=2 ttl=64 time=0.572 ms 64 bytes from 172.16.2.1: icmp_seq=3 ttl=64 time=0.921 ms 64 bytes from 172.16.2.1: icmp_seq=4 ttl=64 time=0.742 ms 64 bytes from 172.16.2.1: icmp_seq=5 ttl=64 time=0.680 ms 64 bytes from 172.16.2.1: icmp_seq=6 ttl=64 time=0.751 ms 64 bytes from 172.16.2.1: icmp_seq=7 ttl=64 time=0.803 ms 64 bytes from 172.16.2.1: icmp_seq=8 ttl=64 time=0.694 ms 64 bytes from 172.16.2.1: icmp_seq=9 ttl=64 time=0.770 ms 64 bytes from 172.16.2.1: icmp_seq=10 ttl=64 time=0.821 ms 64 bytes from 172.16.2.1: icmp_seq=48 ttl=64 time=2005 ms 64 bytes from 172.16.2.1: icmp_seq=47 ttl=64 time=3005 ms 64 bytes from 172.16.2.1: icmp_seq=49 ttl=64 time=1005 ms 64 bytes from 172.16.2.1: icmp_seq=50 ttl=64 time=5.34 ms 64 bytes from 172.16.2.1: icmp_seq=51 ttl=64 time=0.773 ms 64 bytes from 172.16.2.1: icmp_seq=52 ttl=64 time=0.888 ms 64 bytes from 172.16.2.1: icmp_seq=53 ttl=64 time=0.980 ms 64 bytes from 172.16.2.1: icmp_seq=54 ttl=64 time=0.698 ms 64 bytes from 172.16.2.1: icmp_seq=55 ttl=64 time=0.730 ms 64 bytes from 172.16.2.1: icmp_seq=56 ttl=64 time=0.902 ms 64 bytes from 172.16.2.1: icmp_seq=57 ttl=64 time=0.845 ms 64 bytes from 172.16.2.1: icmp_seq=58 ttl=64 time=0.795 ms 64 bytes from 172.16.2.1: icmp_seq=59 ttl=64 time=0.814 ms 64 bytes from 172.16.2.1: icmp_seq=99 ttl=64 time=8.95 ms 64 bytes from 172.16.2.1: icmp_seq=101 ttl=64 time=2.35 ms 64 bytes from 172.16.2.1: icmp_seq=102 ttl=64 time=0.860 ms 64 bytes from 172.16.2.1: icmp_seq=103 ttl=64 time=0.668 ms

I see the following syslogs continuously: — Jun 25 18:46:16 opx23_vm opx_nas_daemon[2832]: [ev_log_t_SAI_PORT:Switch Id: 0], Port 0x1010000000000 is not a valid logical port Jun 25 18:46:16 opx23_vm opx_nas_daemon[2832]: [ev_log_t_SAI_PORT:Switch Id: 0], Attr get for port id 0x1010000000000's attr index 0 attr id 8 failed with err -19 Jun 25 18:46:18 opx23_vm opx_nas_daemon[2832]: [ev_log_t_SAI_FDB:Switch Id: 0], FDB Entry not found for MAC: ff:ff:ff:ff:ff:ff vlan/bridge: 0xa Jun 25 18:46:18 opx23_vm opx_nas_daemon[2832]: [ev_log_t_SAI_NEIGHBOR:Switch Id: 0], Failed to lookup port for Neighbor MAC from L2 FDB, Err: -16. Jun 25 18:46:18 opx23_vm opx_nas_daemon[2832]: [ROUTE:HAL-RT-NDI], Failed to add : host: 172.16.2.5 mac_addr: 00:00:00:00:00:00, state: 3, port: 48 status:0x1 NPU status:0 unit:0 rif:0x600000000000a action: Drop Err -1845428240 Jun 25 18:46:21 opx23_vm opx_nas_daemon[2832]: [INTERFACE:NAS-COM-INT-GET], Get request handler not present Jun 25 18:46:21 opx23_vm opx_nas_daemon[2832]: [INTERFACE:NAS-COM-INT-GET], Get request handler not present Jun 25 18:46:21 opx23_vm opx_nas_daemon[2832]: [INTERFACE:NAS-COM-INT-GET], Get request handler not present Jun 25 18:46:21 opx23_vm opx_nas_daemon[2832]: [INTERFACE:NAS-COM-INT-GET], Get request handler not present Jun 25 18:46:21 opx23_vm opx_nas_daemon[2832]: [ev_log_t_SAI_PORT:Switch Id: 0], Port 0x1010000000000 is not a valid logical port Jun 25 18:46:21 opx23_vm opx_nas_daemon[2832]: [ev_log_t_SAI_PORT:Switch Id: 0], Attr get for port id 0x1010000000000's attr index 0 attr id 8 failed with err -19 Jun 25 18:46:26 opx23_vm opx_nas_daemon[2832]: [INTERFACE:NAS-COM-INT-GET], Get request handler not present Jun 25 18:46:26 opx23_vm opx_nas_daemon[2832]: [INTERFACE:NAS-COM-INT-GET], Get request handler not present Jun 25 18:46:26 opx23_vm opx_nas_daemon[2832]: [INTERFACE:NAS-COM-INT-GET], Get request handler not present Jun 25 18:46:26 opx23_vm opx_nas_daemon[2832]: [INTERFACE:NAS-COM-INT-GET], Get request handler not present Jun 25 18:46:26 opx23_vm opx_nas_daemon[2832]: [ev_log_t_SAI_PORT:Switch Id: 0], Port 0x1010000000000 is not a valid logical port Jun 25 18:46:26 opx23_vm opx_nas_daemon[2832]: [ev_log_t_SAI_PORT:Switch Id: 0], Attr get for port id 0x1010000000000's attr index 0 attr id 8 failed with err -19 Jun 25 18:46:31 opx23_vm opx_nas_daemon[2832]: [ev_log_t_SAI_FDB:Switch Id: 0], FDB Entry not found for MAC: ff:ff:ff:ff:ff:ff vlan/bridge: 0xa Jun 25 18:46:31 opx23_vm opx_nas_daemon[2832]: [ev_log_t_SAI_NEIGHBOR:Switch Id: 0], Failed to lookup port for Neighbor MAC from L2 FDB, Err: -16. Jun 25 18:46:31 opx23_vm opx_nas_daemon[2832]: [ROUTE:HAL-RT-NDI], Failed to add : host: 172.16.2.5 mac_addr: 00:00:00:00:00:00, state: 3, port: 48 status:0x1 NPU status:0 unit:0 rif:0x600000000000a action: Drop Err -1845428240

On OPX side: ARP entry for the server interface does not seems to have been populated:

root@leaf-2:/var/log# ip neigh … 172.16.2.5 dev br10 INCOMPLETE …

GarrickHe commented 6 years ago

@vnam1 I'll take a look. Can you please tell me what version of OPX you're running? Also show the output of 'opx-show-system-status'.

Thank you, Garrick

vnam1 commented 6 years ago

root@leaf-2:/var/log# opx-show-version OS_NAME="OPX" OS_VERSION=2.3.0 PLATFORM="S6000-VM" ARCHITECTURE="x86_64" INTERNAL_BUILD_ID="OpenSwitch blueprint for Dell 1.0.0" BUILD_VERSION=0 BUILD_DATE=2018-03-27T22:44:29+0000 INSTALL_DATE=2018-03-30T20:30:49+0000 SYSTEM_UPTIME= 3 hours, 44 minutes SYSTEM_STATE= degraded UPGRADED_PACKAGES=no ALTERED_PACKAGES=no

Please ignore vboxadd.service below since we use KVM hypervisor.

root@leaf-2:/var/log# opx-show-system-status System State: degraded Failed Services vboxadd-service.service vboxadd.service No Modified Package root@leaf-2:/var/log#

GarrickHe commented 6 years ago

@vnam1 I got a few questions.

  1. Is this issue consistently reproducible on OPX 2.3.0?
  2. You used the cps_config_vlan.py script to create br10, correct? cps_config_vlan.py --add --id 10 --vlantype 1 --port e101-003-0
  3. Is this what your topology look like: [OPX]-----[vswitch]-------[Ubuntu] if not, how is OPX connect to Ubuntu?

I don't have KVM so I'm trying to reproduce this on ESXi.

vnam1 commented 6 years ago

@GarrickHe Yes the config/connectivity you mentioned is correct. This is reproducible always

jeff-yin commented 6 years ago

Related to this? https://github.com/open-switch/opx-sai-vm/issues/15

mikelazar commented 6 years ago

This should have nothing to do with the configuration. I doubt that the logs are relevant either.

It may have something to do with the connectivity from the server to OPX VM. Would it be possible - for instance - that you have two identical IP addresses on the same L2 segment (as OPX VM and server) ? What is your setup in KVM ? How do you interconnect VM to the server ?

Does the problem happen consistently?

Please keep in mind that effectively this is communications between two Linux stacks - on OPX VM and the Linux server - so there's no reason obvious reason for lost packets. BTW, if you remove the bridge and manually assign the same IP address to e101-003-0, do you have the same problem?

I can think of other issues - e.g. running out of CPU power - but I can't tell without more info about your system.

vnam1 commented 6 years ago

Thanks for the pointers. Just want to rule out any issue of my setup. I verified that there no dup IPs on the network.

When I tcpdump on OPX bridge interface this is the pattern I see:

23:26:29.578075 52:54:00:22:4e:56 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.5 tell 172.16.2.1, length 28 23:26:29.591968 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 17, length 64 23:26:30.600000 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 18, length 64 23:26:31.582120 52:54:00:22:4e:56 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.5 tell 172.16.2.1, length 28 23:26:31.607822 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 19, length 64 23:26:32.615730 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 20, length 64 23:26:33.586050 52:54:00:22:4e:56 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.5 tell 172.16.2.1, length 28 23:26:33.623908 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 21, length 64 23:26:34.632033 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 22, length 64 23:26:35.590112 52:54:00:22:4e:56 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.5 tell 172.16.2.1, length 28 23:26:35.639938 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 23, length 64 23:26:36.647880 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 24, length 64 23:26:37.594130 52:54:00:22:4e:56 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.5 tell 172.16.2.1, length 28 23:26:37.655908 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 25, length 64 23:26:38.663997 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 26, length 64 23:26:39.671921 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 27, length 64 23:26:40.679859 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 28, length 64 23:26:41.674113 52:54:00:22:4e:56 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.5 tell 172.16.2.1, length 28 23:26:41.688014 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 29, length 64 23:26:42.695774 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 30, length 64 23:26:43.678059 52:54:00:22:4e:56 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.5 tell 172.16.2.1, length 28 23:26:43.703718 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 31, length 64 23:26:44.711817 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 32, length 64 23:26:45.682109 52:54:00:22:4e:56 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.5 tell 172.16.2.1, length 28 23:26:45.713243 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 33, length 64 23:26:46.719978 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 34, length 64 23:26:47.686112 52:54:00:22:4e:56 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.2.5 tell 172.16.2.1, length 28

It tries to resolve ARP of the server address 172.16.2.5 but somehow this ARP request is not seen on the bridge interface e101-003-0 that is connected to the server.

tcpdump on e101-003-0: root@leaf-2:/home/admin# tcpdump -nNei e101-003-0 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on e101-003-0, link-type EN10MB (Ethernet), capture size 262144 bytes 23:26:42.695774 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 30, length 64 23:26:43.592091 52:54:00:22:4d:f5 > 01:80:c2:00:00:0e, ethertype LLDP (0x88cc), length 282: LLDP, length 268: leaf-2 23:26:43.703718 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 31, length 64 23:26:44.711817 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 32, length 64 23:26:45.713243 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 33, length 64 23:26:46.719978 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 34, length 64 23:26:47.727820 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 35, length 64 23:26:48.736608 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 36, length 64 23:26:49.743652 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 37, length 64 23:26:50.751874 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 38, length 64 23:26:51.759867 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 39, length 64 23:26:52.384835 52:54:00:4b:54:33 > 01:80:c2:00:00:0e, ethertype LLDP (0x88cc), length 217: LLDP, length 203: localhost 23:26:52.767959 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype IPv4 (0x0800), length 98: 172.16.2.5 > 172.16.2.1: ICMP echo request, id 5853, seq 40, length 64

vnam1 commented 6 years ago

root@leaf-2:/home/admin# brctl show br10 bridge name bridge id STP enabled interfaces br10 8000.525400224e56 no e101-003-0

vnam1 commented 6 years ago

In the KVM, OPX VM interfaces and server interfaces have corresponding vnet interfaces and they are connected to each other via bridge instance

vnam1 commented 6 years ago

Yes, this problem is seen every time.

vnam1 commented 6 years ago

ARP requests coming from server to OPX are seen correctly on both e101-003-0 and br10:

on e101-003-0:

23:26:54.750065 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype ARP (0x0806), length 60: Request who-has 172.16.2.1 tell 172.16.2.5, length 46 23:26:54.750127 52:54:00:22:4e:56 > 52:54:00:4b:54:33, ethertype IPv4 (0x0800), length 98: 172.16.2.1 > 172.16.2.5: ICMP echo reply, id 5853, seq 39, length 64 23:26:54.750163 52:54:00:22:4e:56 > 52:54:00:4b:54:33, ethertype IPv4 (0x0800), length 98: 172.16.2.1 > 172.16.2.5: ICMP echo reply, id 5853, seq 40, length 64 23:26:54.750169 52:54:00:22:4e:56 > 52:54:00:4b:54:33, ethertype IPv4 (0x0800), length 98: 172.16.2.1 > 172.16.2.5: ICMP echo reply, id 5853, seq 41, length 64 23:26:54.750204 52:54:00:22:4e:56 > 52:54:00:4b:54:33, ethertype ARP (0x0806), length 42: Reply 172.16.2.1 is-at 52:54:00:22:4e:56, length 28

On br10:

23:26:54.750065 52:54:00:4b:54:33 > 52:54:00:22:4e:56, ethertype ARP (0x0806), length 60: Request who-has 172.16.2.1 tell 172.16.2.5, length 46 23:26:54.750115 52:54:00:22:4e:56 > 52:54:00:4b:54:33, ethertype IPv4 (0x0800), length 98: 172.16.2.1 > 172.16.2.5: ICMP echo reply, id 5853, seq 39, length 64 23:26:54.750158 52:54:00:22:4e:56 > 52:54:00:4b:54:33, ethertype IPv4 (0x0800), length 98: 172.16.2.1 > 172.16.2.5: ICMP echo reply, id 5853, seq 40, length 64 23:26:54.750166 52:54:00:22:4e:56 > 52:54:00:4b:54:33, ethertype IPv4 (0x0800), length 98: 172.16.2.1 > 172.16.2.5: ICMP echo reply, id 5853, seq 41, length 64 23:26:54.750200 52:54:00:22:4e:56 > 52:54:00:4b:54:33, ethertype ARP (0x0806), length 42: Reply 172.16.2.1 is-at 52:54:00:22:4e:56, length 28

jeff-yin commented 6 years ago

@GarrickHe please advise if you've been able to repro on ESXi

GarrickHe commented 6 years ago

Issue not seen on ESXi + OPX2.3.1. Will have to retest this on KVM and with OPX3.0.

Will retest after fix for below issue is finalized. https://github.com/open-switch/opx-nas-l3/issues/24