iqiyi / dpvs

DPVS is a high performance Layer-4 load balancer based on DPDK.
Other
3k stars 723 forks source link

dpvs 启动后CPU使用过高 #55

Closed zeably closed 6 years ago

zeably commented 6 years ago

DPVS 启动后CPU 8核心显示100%,运行5分钟左右,系统宕机,无法访问

Nov 21 21:48:11 localhost dpvs: dpvs_running: Remove a zombie pid file /var/run/dpvs.pid Nov 21 21:48:15 localhost dpvs[2991]: PMD: bnxt_rte_pmd_init() called for (null) Nov 21 21:48:15 localhost dpvs[2991]: EAL: PCI device 0000:01:00.0 on NUMA socket 0 Nov 21 21:48:15 localhost dpvs[2991]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 21 21:48:15 localhost dpvs[2991]: EAL: PCI device 0000:01:00.1 on NUMA socket 0 Nov 21 21:48:15 localhost dpvs[2991]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 21 21:48:15 localhost dpvs[2991]: EAL: PCI device 0000:01:00.2 on NUMA socket 0 Nov 21 21:48:15 localhost dpvs[2991]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 21 21:48:15 localhost dpvs[2991]: EAL: PCI device 0000:01:00.3 on NUMA socket 0 Nov 21 21:48:15 localhost dpvs[2991]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 21 21:48:15 localhost dpvs[2991]: CFG_FILE: Opening configuration file '/etc/dpvs.conf'. Nov 21 21:48:15 localhost dpvs[2991]: CFG_FILE: log_level = WARNING Nov 21 21:48:15 localhost dpvs[2991]: NETIF: dpdk0:rx_queue_number = 8 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu1:dpdk0 rx_queue_id += 0 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu1:dpdk0 tx_queue_id += 0 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu2:dpdk0 rx_queue_id += 1 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu2:dpdk0 tx_queue_id += 1 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu3:dpdk0 rx_queue_id += 2 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu3:dpdk0 tx_queue_id += 2 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu4:dpdk0 rx_queue_id += 3 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu4:dpdk0 tx_queue_id += 3 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu5:dpdk0 rx_queue_id += 4 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu5:dpdk0 tx_queue_id += 4 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu6:dpdk0 rx_queue_id += 5 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu6:dpdk0 tx_queue_id += 5 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu7:dpdk0 rx_queue_id += 6 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu7:dpdk0 tx_queue_id += 6 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu8:dpdk0 rx_queue_id += 7 Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu8:dpdk0 tx_queue_id += 7 Nov 21 21:48:16 localhost kernel: KNI: Creating kni...

zeably commented 6 years ago

Nov 21 21:48:15 localhost dpvs[2991]: NETIF: worker cpu8:dpdk0 tx_queue_id += 7 Nov 21 21:48:16 localhost kernel: KNI: Creating kni... Nov 21 21:48:16 localhost kernel: igb_uio 0000:01:00.1: Intel(R) Gigabit Ethernet Network Connection Nov 21 21:48:16 localhost kernel: igb_uio 0000:01:00.1: 0000:01:00.1: (PCIe:5.0GT/s:Width x4) Nov 21 21:48:16 localhost kernel: igb_uio 0000:01:00.1: 0000:01:00.1: MAC: Nov 21 21:48:16 localhost kernel: 24:6e:96:3c:84:35 Nov 21 21:48:16 localhost kernel: igb_uio 0000:01:00.1: 0000:01:00.1: PBA No: G10565-019 Nov 21 21:48:16 localhost kernel: igb_uio 0000:01:00.1: LRO is disabled Nov 21 21:48:16 localhost kernel: igb_uio 0000:01:00.1: Using legacy interrupts. 0 rx queue(s), 0 tx queue(s) Nov 21 21:48:16 localhost NetworkManager[1364]: (dpdk0.kni): failed to find device 7 'dpdk0.kni' with udev Nov 21 21:48:16 localhost NetworkManager[1364]: (dpdk0.kni): new Ethernet device (carrier: OFF, driver: 'igb', ifindex: 7) Nov 21 21:48:16 localhost NetworkManager[1364]: (dpdk0.kni): device state change: unmanaged -> unavailable (reason 'managed') [10 20 2] Nov 21 21:48:16 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): dpdk0.kni: link is not ready Nov 21 21:48:19 localhost NetworkManager[1364]: [1511272099.706242] [platform/nm-linux-platform.c:2878] do_change_link(): platform-linux: do-change-link: failure changing link 7: Unspecific failure (1)

beacer commented 6 years ago

DPDK application always use 100% cpu, because of it's polling model, it's not an issue. Pls disable NetworkManager when running DPVS, or it will config kni undesired, and affect the network.

zeably commented 6 years ago

The problem is still

[root@localhost ~]# systemctl stop NetworkManager.service [root@localhost ~]# systemctl disabled NetworkManager.service
Unknown operation 'disabled'. [root@localhost ~]# systemctl disable NetworkManager.service
Removed symlink /etc/systemd/system/multi-user.target.wants/NetworkManager.service. Removed symlink /etc/systemd/system/dbus-org.freedesktop.NetworkManager.service. Removed symlink /etc/systemd/system/dbus-org.freedesktop.nm-dispatcher.service.

top - 14:18:09 up 15:58, 3 users, load average: 2.60, 0.61, 0.24 Tasks: 700 total, 2 running, 698 sleeping, 0 stopped, 0 zombie %Cpu0 : 18.8 us, 81.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

beacer commented 6 years ago

Any DPDK app consumes 100% cpu, I mean cpu 100% is not an issue. If your machine still get down after disable NetworkManager pls show the logs.

zeably commented 6 years ago

Nov 22 14:17:24 localhost kernel: igb_uio: module verification failed: signature and/or required key missing - tainting kernel Nov 22 14:17:24 localhost kernel: igb_uio: Use MSIX interrupt by default Nov 22 14:17:29 localhost kernel: igb 0000:01:00.1: removed PHC on em2 Nov 22 14:17:29 localhost kernel: igb_uio 0000:01:00.1: uio device registered with irq 51 Nov 22 14:17:30 localhost kernel: usb 1-1.4: USB disconnect, device number 67 Nov 22 14:17:31 localhost kernel: usb 1-1.4: new low-speed USB device number 68 using ehci-pci Nov 22 14:17:31 localhost kernel: usb 1-1.4: New USB device found, idVendor=046d, idProduct=c077 Nov 22 14:17:31 localhost kernel: usb 1-1.4: New USB device strings: Mfr=1, Product=2, SerialNumber=0 Nov 22 14:17:31 localhost kernel: usb 1-1.4: Product: USB Optical Mouse Nov 22 14:17:31 localhost kernel: usb 1-1.4: Manufacturer: Logitech Nov 22 14:17:31 localhost kernel: input: Logitech USB Optical Mouse as /devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.4/1-1.4:1.0/input/input936 Nov 22 14:17:31 localhost kernel: hid-generic 0003:046D:C077.03A7: input,hidraw0: USB HID v1.11 Mouse [Logitech USB Optical Mouse] on usb-0000:00:1a.0-1.4/input0 Nov 22 14:17:43 localhost dpvs[11507]: PMD: bnxt_rte_pmd_init() called for (null) Nov 22 14:17:43 localhost dpvs[11507]: EAL: PCI device 0000:01:00.0 on NUMA socket 0 Nov 22 14:17:43 localhost dpvs[11507]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 22 14:17:43 localhost dpvs[11507]: EAL: PCI device 0000:01:00.1 on NUMA socket 0 Nov 22 14:17:43 localhost dpvs[11507]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 22 14:17:43 localhost dpvs[11507]: EAL: PCI device 0000:01:00.2 on NUMA socket 0 Nov 22 14:17:43 localhost dpvs[11507]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 22 14:17:43 localhost dpvs[11507]: EAL: PCI device 0000:01:00.3 on NUMA socket 0 Nov 22 14:17:43 localhost dpvs[11507]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 22 14:17:43 localhost dpvs[11507]: CFG_FILE: Opening configuration file '/etc/dpvs.conf'. Nov 22 14:17:43 localhost dpvs[11507]: CFG_FILE: log_level = DEBUG Nov 22 14:17:43 localhost dpvs[11507]: CFG_FILE: log_file = /var/log/dpvs.log Nov 22 14:17:44 localhost kernel: KNI: Creating kni... Nov 22 14:17:44 localhost kernel: igb_uio 0000:01:00.1: Intel(R) Gigabit Ethernet Network Connection Nov 22 14:17:44 localhost kernel: igb_uio 0000:01:00.1: 0000:01:00.1: (PCIe:5.0GT/s:Width x4) Nov 22 14:17:44 localhost kernel: igb_uio 0000:01:00.1: 0000:01:00.1: MAC: Nov 22 14:17:44 localhost kernel: 24:6e:96:3c:84:35 Nov 22 14:17:44 localhost kernel: igb_uio 0000:01:00.1: 0000:01:00.1: PBA No: G10565-019 Nov 22 14:17:44 localhost kernel: igb_uio 0000:01:00.1: LRO is disabled Nov 22 14:17:44 localhost kernel: igb_uio 0000:01:00.1: Using legacy interrupts. 0 rx queue(s), 0 tx queue(s) Nov 22 14:24:36 localhost rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="1363" x-info="http://www.rsyslog.com"] start Nov 22 14:24:30 localhost journal: Runtime journal is using 8.0M (max allowed 4.0G, trying to leave 4.0G free of 62.8G available → current limit 4.0G). Nov 22 14:24:30 localhost kernel: Initializing cgroup subsys cpuset Nov 22 14:24:30 localhost kernel: Initializing cgroup subsys cpu Nov 22 14:24:30 localhost kernel: Initializing cgroup subsys cpuacct Nov 22 14:24:30 localhost kernel: Linux version 3.10.0-327.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Nov 19 22:10:57 UTC 201 5 Nov 22 14:24:30 localhost kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-327.el7.x86_64 root=/dev/mapper/centos-root ro rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet LANG=zh_CN.UTF- 8

beacer commented 6 years ago

Didn't see any useful info from the log.

zeably commented 6 years ago

2017.11.23 re-install OS Linux Distribution: CentOS 7.2 Kernel: 3.10.0-327.el7.x86_64 CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz NIC: Intel Corporation I350 Gigabit Network Connection (rev 01) Memory: 128G GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)

NetworkManager.service stop selinux disabled

[root@dpvs dpdk-stable-16.07.2]# ./tools/dpdk-devbind.py --st

Network devices using DPDK-compatible driver

0000:01:00.0 'I350 Gigabit Network Connection' drv=igb_uio unused=

Network devices using kernel driver

0000:01:00.1 'I350 Gigabit Network Connection' if=em2 drv=igb unused=igb_uio 0000:01:00.2 'I350 Gigabit Network Connection' if=em3 drv=igb unused=igb_uio Active 0000:01:00.3 'I350 Gigabit Network Connection' if=em4 drv=igb unused=igb_uio

[root@dpvs bin]# ./dpvs EAL: Detected 32 lcore(s) EAL: No free hugepages reported in hugepages-1048576kB EAL: Probing VFIO support... PMD: bnxt_rte_pmd_init() called for (null) EAL: PCI device 0000:01:00.0 on NUMA socket 0 EAL: probe driver: 8086:1521 rte_igb_pmd EAL: PCI device 0000:01:00.1 on NUMA socket 0 EAL: probe driver: 8086:1521 rte_igb_pmd EAL: PCI device 0000:01:00.2 on NUMA socket 0 EAL: probe driver: 8086:1521 rte_igb_pmd EAL: PCI device 0000:01:00.3 on NUMA socket 0 EAL: probe driver: 8086:1521 rte_igb_pmd CFG_FILE: Opening configuration file '/etc/dpvs.conf'. CFG_FILE: log_level = WARNING

tail -f /var/log/messages

Nov 23 13:06:00 dpvs kernel: igb 0000:01:00.0: removed PHC on em1 Nov 23 13:06:00 dpvs kernel: igb_uio 0000:01:00.0: uio device registered with irq 43 Nov 23 13:06:14 dpvs dpvs[1860]: PMD: bnxt_rte_pmd_init() called for (null) Nov 23 13:06:14 dpvs dpvs[1860]: EAL: PCI device 0000:01:00.0 on NUMA socket 0 Nov 23 13:06:14 dpvs dpvs[1860]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 23 13:06:14 dpvs dpvs[1860]: EAL: PCI device 0000:01:00.1 on NUMA socket 0 Nov 23 13:06:14 dpvs dpvs[1860]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 23 13:06:14 dpvs dpvs[1860]: EAL: PCI device 0000:01:00.2 on NUMA socket 0 Nov 23 13:06:14 dpvs dpvs[1860]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 23 13:06:14 dpvs dpvs[1860]: EAL: PCI device 0000:01:00.3 on NUMA socket 0 Nov 23 13:06:14 dpvs dpvs[1860]: EAL: probe driver: 8086:1521 rte_igb_pmd Nov 23 13:06:14 dpvs dpvs[1860]: CFG_FILE: Opening configuration file '/etc/dpvs.conf'. Nov 23 13:06:14 dpvs dpvs[1860]: CFG_FILE: log_level = WARNING Nov 23 13:06:15 dpvs kernel: KNI: Creating kni... Nov 23 13:06:15 dpvs kernel: igb_uio 0000:01:00.0: Intel(R) Gigabit Ethernet Network Connection Nov 23 13:06:15 dpvs kernel: igb_uio 0000:01:00.0: 0000:01:00.0: (PCIe:5.0GT/s:Width x4) Nov 23 13:06:15 dpvs kernel: igb_uio 0000:01:00.0: 0000:01:00.0: MAC: Nov 23 13:06:15 dpvs kernel: 24:6e:96:3c:84:34 Nov 23 13:06:15 dpvs kernel: igb_uio 0000:01:00.0: 0000:01:00.0: PBA No: G10565-019 Nov 23 13:06:15 dpvs kernel: igb_uio 0000:01:00.0: LRO is disabled Nov 23 13:06:15 dpvs kernel: igb_uio 0000:01:00.0: Using legacy interrupts. 0 rx queue(s), 0 tx queue(s) Nov 23 13:06:23 dpvs systemd-logind: New session 2 of user root. Nov 23 13:06:23 dpvs systemd: Started Session 2 of user root. Nov 23 13:06:23 dpvs systemd: Starting Session 2 of user root.

problem1

top - 13:07:46 up 3 min, 3 users, load average: 7.14, 2.36, 0.84 Tasks: 705 total, 2 running, 703 sleeping, 0 stopped, 0 zombie %Cpu0 : 16.2 us, 83.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

problem2 1 minute after the system downtime

beacer commented 6 years ago

How about dmesg or /var/log/dmesg. From the log you provided, I cannot found any cue about why system went down (Problem 2). For problem 1, As I just said, CPU 100% is the normal phenomenon, not an issue for DPDK application. If (1+8) cores are configured for DPVS. How many CPU core have you on the machine, pls left at least one core for Kernel. And I suggest to run DPDK examples, like l3fwd, to see if there's any issue.

zeably commented 6 years ago

VIP=192.168.100.100 LIP=192.168.100.200 RS=192.168.100.2 ./dpip addr add ${VIP}/24 dev dpdk0 ./ipvsadm -A -t ${VIP}:80 -s rr ./ipvsadm -a -t ${VIP}:80 -r ${RS} -b ./ipvsadm --add-laddr -z ${LIP} -t 192.168.100.100:80 -F dpdk0

目前已运行30分钟未见异常,持续关注中