multipath-tcp / mptcp

⚠️⚠️⚠️ Deprecated 🚫 Out-of-tree Linux Kernel implementation of MultiPath TCP. 👉 Use https://github.com/multipath-tcp/mptcp_net-next repo instead ⚠️⚠️⚠️
https://github.com/multipath-tcp/mptcp_net-next
Other
890 stars 335 forks source link

Kernel BUG #259

Closed virus-atl closed 6 years ago

virus-atl commented 6 years ago

Our servers sometimes take kernel BUG. Tested kernels 4.9.60.mptcp and 4.14.24.mptcp

[1018387.875125] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] [1018387.876377] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G L 4.9.60.mptcp #5 [1018387.877572] Hardware name: Supermicro SYS-5018R-M/X10SRi-F, BIOS 2.0b 05/02/2017 [1018387.878758] task: ffffffff8180e540 task.stack: ffffffff81800000 [1018387.879981] RSP: 0018:ffff88047f203a90 EFLAGS: 00000246 [1018387.881191] RAX: ffff880468d44900 RBX: ffff88045f968400 RCX: 0000000000000091 [1018387.882426] RDX: ffff880468d44900 RSI: 0000000000000000 RDI: ffff88045f968ac8 [1018387.883657] RBP: ffff88045f968530 R08: ffffc90002222800 R09: 0000000000000000 [1018387.884897] R10: 0000000062300000 R11: ffff88047f203584 R12: ffff88045f968ac8 [1018387.886144] R13: ffff88043f85bac0 R14: ffff88046d313780 R15: ffff88045fbd1500 [1018387.887395] FS: 0000000000000000(0000) GS:ffff88047f200000(0000) knlGS:0000000000000000 [1018387.888730] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1018387.890007] CR2: 00007f2936d35000 CR3: 0000000001807000 CR4: 00000000003406f0 [1018387.891301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1018387.892588] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1018387.893864] Stack: [1018387.895138] Call Trace: [1018387.896398] [] ? skb_rbtree_purge+0x11/0x50 [1018387.897665] [] ? tcp_v4_destroy_sock+0x160/0x2b0 [1018387.898909] [] ? inet_csk_destroy_sock+0x47/0x160 [1018387.900150] [] ? mptcp_check_req_child+0x9c/0x340 [1018387.901372] [] ? tcp_check_req+0x59e/0x5e0 [1018387.902583] [] ? igmp_mcf_seq_next+0x21/0xd0 [1018387.903787] [] ? tcp_v4_inbound_md5_hash+0x68/0x18c [1018387.904988] [] ? tcp_v4_rcv+0x78a/0xce0 [1018387.906183] [] ? nf_nat_ipv4_fn+0x61/0x210 [nf_nat_ipv4] [1018387.907386] [] ? ipv4_confirm+0x7b/0xf0 [nf_conntrack_ipv4] [1018387.908590] [] ? ip_local_deliver_finish+0x97/0x1e0 [1018387.909780] [] ? ip_local_deliver+0x5b/0xd0 [1018387.910944] [] ? ip_rcv_finish+0x390/0x390 [1018387.912099] [] ? ip_rcv+0x264/0x390 [1018387.913244] [] ? find_busiest_group+0x12/0x4a0 [1018387.914404] [] ? inet_del_offload+0x40/0x40 [1018387.915609] [] ? __netif_receive_skb_core+0x2ae/0xa20 [1018387.916768] [] ? inet_gro_receive+0x1fb/0x290 [1018387.917924] [] ? netif_receive_skb_internal+0x1f/0x80 [1018387.919087] [] ? napi_gro_receive+0xbc/0xe0 [1018387.920252] [] ? igb_poll+0x6e0/0xe70 [igb] [1018387.921412] [] ? irq_exit+0x3c/0xa0 [1018387.922560] [] ? do_IRQ+0x4f/0xd0 [1018387.923710] [] ? net_rx_action+0x221/0x360 [1018387.924860] [] ? __do_softirq+0x106/0x292 [1018387.926011] [] ? irq_exit+0x98/0xa0 [1018387.927158] [] ? do_IRQ+0x4f/0xd0 [1018387.928294] [] ? common_interrupt+0x82/0x82 [1018387.929434] [] ? cpuidle_enter_state+0x113/0x260 [1018387.930574] [] ? cpuidle_enter_state+0xee/0x260 [1018387.931707] [] ? cpu_startup_entry+0x16e/0x250 [1018387.932815] [] ? start_kernel+0x474/0x47c [1018387.933905] [] ? early_idt_handler_array+0x120/0x120 [1018387.934983] [] ? x86_64_start_kernel+0x145/0x154

cpaasch commented 6 years ago

Thanks for the report - I see what's going wrong. I will submit a bug-fix soon.

virus-atl commented 6 years ago

We tested all kernels. And 4.14.x & 4.9.x have random kernel BUGs(under mptcp load medium uptime 0-3 days). 4.4.100 - stable(now uptime under mptcp load 15 days, no BUGS)

cpaasch commented 6 years ago

Are all of these BUGs of the same type as the one that you reported?

virus-atl commented 6 years ago

No, diffrent. 407819.286867] BUG: Bad page state in process zabbix_agentd pfn:46c479 [407819.288361] page:ffffea000f7afa78 count:-2 mapcount:0 mapping: (null) index:0x0 [407819.289875] flags: 0x2ffff8000000000() [407819.291350] raw: 02ffff8000000000 0000000000000000 0000000000000000 fffffffeffffffff [407819.292877] raw: dead000000000100 dead000000000200 0000000000000000 [407819.294388] page dumped because: nonzero _count [407819.295893] Modules linked in: xt_tcpudp xt_conntrack iptable_filter netconsole configfs xt_nat openvswitch nf_conntrack_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c crc32c_generic bonding tun intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt kvm_intel iTCO_vendor_support kvm ast irqbypass ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel pcbc drm snd_pcm snd_timer snd aesni_intel soundcore joydev aes_x86_64 mei_me crypto_simd lpc_ich glue_helper evdev cryptd pcspkr sg mei mfd_core ioatdma shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad button tcp_veno sunrpc 8021q garp mrp stp llc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 hid_generic usbhid hid [407819.307988] sd_mod xhci_pci ehci_pci xhci_hcd ehci_hcd ahci crc32c_intel libahci libata i2c_i801 usbcore scsi_mod igb i2c_algo_bit dca ptp pps_core [last unloaded: netconsole] [407819.311804] CPU: 2 PID: 24112 Comm: zabbix_agentd Not tainted 4.14.24.mptcp #9 [407819.313732] Hardware name: Supermicro SYS-5018R-M/X10SRi-F, BIOS 2.0b 05/02/2017 [407819.315680] Call Trace: [407819.317613] dump_stack+0x5c/0x84 [407819.319559] bad_page+0xcb/0x130 [407819.321481] get_page_from_freelist+0xb5c/0xbf0 [407819.323431] ? get_page_from_freelist+0x5/0xbf0 [407819.325322] alloc_pages_nodemask+0xf3/0x200 [407819.327219] pte_alloc_one+0x13/0x40 [407819.329139] pte_alloc+0x1b/0x120 [407819.331008] copy_page_range+0x90a/0xbe0 [407819.332891] ? stack_trace_call+0x2f/0x40 [407819.334757] ? 0xffffffffa0006067 [407819.336605] copy_process.part.39+0xcba/0x1a40 [407819.338466] _do_fork+0xbd/0x360 [407819.340306] ? _do_fork+0x5/0x360 [407819.342106] do_syscall_64+0x68/0x120 [407819.343976] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [407819.345800] RIP: 0033:0x7f85e587838b [407819.347599] RSP: 002b:00007ffd798a8890 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 [407819.349439] RAX: ffffffffffffffda RBX: 00007ffd798a8890 RCX: 00007f85e587838b [407819.351285] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 [407819.353143] RBP: 00007ffd798a8920 R08: 00007f85e6bfb740 R09: 00007f85e6bfb740 [407819.354995] R10: 00007f85e6bfba10 R11: 0000000000000246 R12: 0000000000000000 [407819.356859] R13: 0000000000000020 R14: 0000000000000000 R15: 00007ffd798a88b0 [407819.358721] Disabling lock debugging due to kernel taint

virus-atl commented 6 years ago

[66565.169424] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [66565.191279] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.9.87 #1 [66565.193003] Hardware name: Supermicro SYS-5018R-M/X10SRi-F, BIOS 3.0a 02/08/2018 [66565.194747] task: ffff88046ea24f80 task.stack: ffffc90001958000 [66565.177862] Modules linked in: xt_tcpudp xt_conntrack iptable_filter netconsole configfs xt_nat openvswitch nf_conntrack_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c crc32c_generic bonding tun intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt kvm_intel iTCO_vendor_support kvm ast irqbypass ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel pcbc drm snd_pcm snd_timer snd aesni_intel soundcore joydev aes_x86_64 mei_me crypto_simd lpc_ich glue_helper evdev cryptd pcspkr sg mei mfd_core ioatdma shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad button tcp_veno sunrpc 8021q garp mrp stp llc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 hid_generic usbhid hid sd_mod xhci_pci ehci_pci xhci_hcd ehci_hcd ahci crc32c_intel libahci libata i2c_i801 usbcore scsi_mod igb i2c_algo_bit dca ptp pps_core [last unloaded: netconsole] [66565.225730] Call Trace: [66565.227591] [] ? tcp_data_queue_ofo+0xf6/0x6c0 [66565.229471] [] ? mptcp_data_ready+0x11ac/0x1760 [66565.231362] [] ? mptcp_data_ready+0x5/0x1760 [66565.233239] [] ? tcp_data_queue+0x3af/0x5e0 [66565.235109] [] ? tcp_rcv_established+0x17b/0x610 [66565.236976] [] ? tcp_rcv_established+0x5/0x610 [66565.238829] [] ? tcp_v4_do_rcv+0x107/0x260 [66565.240674] [] ? tcp_v4_rcv+0x981/0xd50 [66565.242507] [] ? tcp_v4_early_demux+0x150/0x150 [66565.244341] [] ? tcp_v4_rcv+0x5/0xd50 [66565.246156] [] ? ip_local_deliver_finish+0x9f/0x1d0 [66565.247979] [] ? ip_local_deliver+0x5b/0xd0 [66565.249785] [] ? ip_rcv_finish+0x3a0/0x3a0 [66565.251584] [] ? ip_rcv+0x263/0x380 [66565.253352] [] ? inet_del_offload+0x40/0x40 [66565.255106] [] ? netif_receive_skb_core+0x50d/0xa10 [66565.256870] [] ? 0xffffffffa0007067 [66565.258626] [] ? tcp_gro_receive+0x300/0x300 [66565.260396] [] ? recalibrate_cpu_khz+0x10/0x10 [66565.262171] [] ? __netif_receive_skb_core+0x5/0xa10 [66565.263911] [] ? netif_receive_skb_internal+0x1f/0x80 [66565.265610] [] ? napi_gro_flush+0x5/0x70 [66565.267262] [] ? napi_gro_flush+0x50/0x70 [66565.268860] [] ? napi_complete_done+0x5b/0xb0 [66565.270417] [] ? igb_poll+0x8b8/0xeb0 [igb] [66565.271941] [] ? net_rx_action+0x222/0x350 [66565.273409] [] ? do_softirq+0x10a/0x29e [66565.274830] [] ? irq_exit+0xae/0xb0 [66565.276218] [] ? do_IRQ+0x4f/0xd0 [66565.277588] [] ? common_interrupt+0x96/0x96 [66565.278969] [] ? cpuidle_enter_state+0xa2/0x2d0 [66565.280327] [] ? cpuidle_enter_state+0x90/0x2d0 [66565.281683] [] ? cpu_startup_entry+0x144/0x230 [66565.283026] [] ? start_secondary+0x15b/0x180

matttbe commented 6 years ago

Hi @virus-atl

This should be fixed with 4f98a712e717 (mptcp_v0.93) and 847d641a8cdd (mptcp_v0.94).

Please re-open this bug report if it is not the case!

Cheers, Matt