platinasystems / go

Other
9 stars 68 forks source link

cpu hangs after repeated goes stop/start #46

Open fszyang opened 7 years ago

fszyang commented 7 years ago

After 2+ hours of repeated goes stop/start loops, cpu hangs Need power cycle to recover

tested on commit 73003a9d242d24005d1c6c222221447678a333eb

console shows:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [\] ixgbevf_alloc_rx_buffers+0x88/0x1f0 [ixgbevf] PGD 0 Oops: 0000 [#1] SMP Modules linked in: xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 ixgbevf iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay iptable_raw nls_utf8 nls_cp437 vfat fat kvm_intel kvm uio_pci_dma i2c_i801 i2c_smbus autofs4 dm_mod ixgbe mdio CPU: 7 PID: 8609 Comm: kworker/u16:2 Tainted: G W 4.9.8-platina-mk1-amd64 #2 Hardware name: Intel Camelback Mountain Platina DC/Camelback Mountain Platina DC, BIOS 6a724f1-dirty 03/14/2017 Workqueue: ixgbevf ixgbevf_service_task [ixgbevf] task: ffff88046b0d0000 task.stack: ffffc90006f5c000 RIP: 0010:[\] [\] ixgbevf_alloc_rx_buffers+0x88/0x1f0 [ixgbevf] RSP: 0018:ffffc90006f5fd28 EFLAGS: 00010287 RAX: 0000000000000200 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 00000000000001ff RDI: ffff88044e1f9000 RBP: ffffc90006f5fd60 R08: 0000000000000000 R09: 0000000000100000 R10: 00000001006cecce R11: 0000000000000001 R12: 0000000000000000 R13: 00000000fffffe00 R14: 00000000000001ff R15: ffff88044e1f9000 FS: 0000000000000000(0000) GS:ffff88047fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000007ea07000 CR4: 00000000003406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 0000000000000000 ffff880460919b28 ffff880460919840 0000000000000009 ffff880460919b28 ffff88044e1f9000 0000000000001028 ffffc90006f5fdc0 ffffffffa060b1d3 ffff880400001028 0000000000000000 ffff880460919ab0 Call Trace: [\] ixgbevf_configure+0x5a3/0x710 [ixgbevf] [\] ixgbevf_reinit_locked+0x38/0x70 [ixgbevf] [\] ixgbevf_service_task+0x1a6/0x330 [ixgbevf] [\] process_one_work+0x12a/0x330 [\] worker_thread+0x61/0x490 [\] ? process_one_work+0x330/0x330 [\] kthread+0xd7/0xf0 [\] ? kthread_park+0x60/0x60 [\] ret_from_fork+0x22/0x30 Code: 49 83 c4 10 48 83 c3 18 49 89 44 24 f0 41 83 c5 01 0f 84 14 01 00 00 66 41 83 ee 01 49 c7 44 24 08 00 00 00 00 0f 84 c8 00 00 00 \<48> 83 7b 08 00 75 c5 31 c9 31 f6 48 c7 c2 00 0d 87 81 bf 20 63 RIP [\] ixgbevf_alloc_rx_buffers+0x88/0x1f0 [ixgbevf] RSP \ CR2: 0000000000000008 ---[ end trace 41feb073b5b2881a ]--- BUG: unable to handle kernel NULL pointer dereference at 0000000000000009 IP: [\] wake_up_common+0x26/0x80 PGD 0 Oops: 0000 [#2] SMP Modules linked in: xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 ixgbevf iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay iptable_raw nls_utf8 nls_cp437 vfat fat kvm_intel kvm uio_pci_dma i2c_i801 i2c_smbus autofs4 dm_mod ixgbe mdio CPU: 7 PID: 8609 Comm: kworker/u16:2 Tainted: G D W 4.9.8-platina-mk1-amd64 #2 Hardware name: Intel Camelback Mountain Platina DC/Camelback Mountain Platina DC, BIOS 6a724f1-dirty 03/14/2017 task: ffff88046b0d0000 task.stack: ffffc90006f5c000 RIP: 0010:[\] [\] wake_up_common+0x26/0x80 RSP: 0018:ffffc90006f5fe50 EFLAGS: 00010086 RAX: 0000000000000082 RBX: ffffc90006f5ff18 RCX: 0000000000000000 RDX: 0000000000000009 RSI: 0000000000000003 RDI: ffffc90006f5ff18 RBP: ffffc90006f5fe88 R08: 0000000000000000 R09: 0000000000000000 R10: 000000000007e952 R11: 000000000007e952 R12: ffffc90006f5ff20 R13: 0000000000000082 R14: 0000000000000000 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff88047fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 000000007ea07000 CR4: 00000000003406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 0000000100000046 0000000000000000 ffffc90006f5ff18 ffffc90006f5ff10 0000000000000082 0000000000000000 0000000000000000 ffffc90006f5fe98 ffffffff810b0e6e ffffc90006f5fec0 ffffffff810b17e2 0000000000000000 Call Trace: [\] wake_up_locked+0xe/0x10 [\] complete+0x32/0x50 [\] mm_release+0x9e/0x100 [\] do_exit+0x284/0xab0 [\] rewind_stack_do_exit+0x17/0x19 Code: 00 00 00 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 41 89 f7 53 4c 8d 67 08 41 89 ce 48 83 ec 10 89 55 cc 48 8b 57 08 4c 89 45 d0 \<48> 8b 32 49 39 d4 48 8d 42 e8 4c 8d 6e e8 75 05 eb 38 49 89 d5 RIP [\] wake_up_common+0x26/0x80 RSP \ CR2: 0000000000000009 ---[ end trace 41feb073b5b2881b ]--- Fixing recursive fault but reboot is needed! BUG: unable to handle kernel paging request at ffffffffffffffd8 IP: [\] kthread_data+0xb/0x20 PGD 7ea08067 PUD 7ea0a067 PMD 0 Oops: 0000 [#3] SMP Modules linked in: xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 ixgbevf iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay iptable_raw nls_utf8 nls_cp437 vfat fat kvm_intel kvm uio_pci_dma i2c_i801 i2c_smbus autofs4 dm_mod ixgbe mdio CPU: 7 PID: 8609 Comm: kworker/u16:2 Tainted: G D W 4.9.8-platina-mk1-amd64 #2 Hardware name: Intel Camelback Mountain Platina DC/Camelback Mountain Platina DC, BIOS 6a724f1-dirty 03/14/2017 task: ffff88046b0d0000 task.stack: ffffc90006f5c000 RIP: 0010:[\] [\] kthread_data+0xb/0x20 RSP: 0018:ffffc90006f5fe78 EFLAGS: 00010002 RAX: 0000000000000000 RBX: 0000000000015300 RCX: 0000000000000007 RDX: ffff88046d805000 RSI: ffff88046b0d0000 RDI: ffff88046b0d0000 RBP: ffffc90006f5fe78 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 00000000000001da R12: ffff88047fdd5300 R13: ffff88046b0d0000 R14: ffff88046b0d04b0 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88047fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 000000007ea07000 CR4: 00000000003406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: ffffc90006f5fe88 ffffffff81092a79 ffffc90006f5fec8 ffffffff814c328e ffffc90006f5fef0 ffff88046b0d0000 0000000000000009 0000000000000046 ffff88046b0d0000 0000000000000000 ffffc90006f5fee0 ffffffff814c3571 Call Trace: [\] wq_worker_sleeping+0x9/0x80 [\] schedule+0x2ce/0x580 [\] schedule+0x31/0x80 [\] do_exit+0x884/0xab0 [\] rewind_stack_do_exit+0x17/0x19 Code: be 7d 04 00 00 48 c7 c7 b8 75 6d 81 e8 af 5c fe ff eb ca 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 87 50 04 00 00 55 48 89 e5 \<48> 8b 40 d8 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 RIP [\] kthread_data+0xb/0x20 RSP \ CR2: ffffffffffffffd8 ---[ end trace 41feb073b5b2881c ]--- Fixing recursive fault but reboot is needed! INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=794 (detected by 4, t=5255 jiffies, g=985507, c=985506, q=549) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 rcu_sched kthread starved for 3604 jiffies! g985507 c985506 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 rcu_sched S 0 7 2 0x00000000 0000000000000000 ffff88046182a580 ffff88046d8f4800 ffff88047fc55300 ffff88046b1e3c00 ffffc9000003bdb8 ffffffff814c314e 0000000000000282 ffff88046d8f4800 ffff88047fc4eb80 ffff88047fc4eb80 ffffc9000003bdf0 Call Trace: [\] ? schedule+0x18e/0x580 [\] schedule+0x31/0x80 [\] schedule_timeout+0xf9/0x180 [\] ? del_timer_sync+0x50/0x50 [\] rcu_gp_kthread+0x3be/0x7d0 [\] ? force_qs_rnp+0x180/0x180 [\] kthread+0xd7/0xf0 [\] ? kthread_park+0x60/0x60 [\] ? kthread_park+0x60/0x60 [\] ret_from_fork+0x22/0x30 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=5824 (detected by 0, t=21007 jiffies, g=985507, c=985506, q=568) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 rcu_sched kthread starved for 5402 jiffies! g985507 c985506 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 rcu_sched S 0 7 2 0x00000000 0000000000000000 ffff88046182a580 ffff88046d8f4800 ffff88047fd15300 ffff88046d96a400 ffffc9000003bdb8 ffffffff814c314e 0000000000000282 ffff88046d8f4800 ffff88047fd0eb80 ffff88047fd0eb80 ffffc9000003bdf0 Call Trace: [\] ? schedule+0x18e/0x580 [\] schedule+0x31/0x80 [\] schedule_timeout+0xf9/0x180 [\] ? del_timer_sync+0x50/0x50 [\] rcu_gp_kthread+0x3be/0x7d0 [\] ? force_qs_rnp+0x180/0x180 [\] kthread+0xd7/0xf0 [\] ? kthread_park+0x60/0x60 [\] ? kthread_park+0x60/0x60 [\] ret_from_fork+0x22/0x30 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=13674 (detected by 0, t=36762 jiffies, g=985507, c=985506, q=570) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=21535 (detected by 0, t=52517 jiffies, g=985507, c=985506, q=574) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=29396 (detected by 0, t=68272 jiffies, g=985507, c=985506, q=576) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=37257 (detected by 0, t=84027 jiffies, g=985507, c=985506, q=580) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? __const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=45118 (detected by 0, t=99782 jiffies, g=985507, c=985506, q=582) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=52979 (detected by 0, t=115537 jiffies, g=985507, c=985506, q=586) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=60840 (detected by 0, t=131292 jiffies, g=985507, c=985506, q=588) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? __const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=68701 (detected by 0, t=147047 jiffies, g=985507, c=985506, q=588) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=76562 (detected by 0, t=162802 jiffies, g=985507, c=985506, q=588) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=84423 (detected by 0, t=178557 jiffies, g=985507, c=985506, q=588) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? __const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=92284 (detected by 0, t=194312 jiffies, g=985507, c=985506, q=588) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=100145 (detected by 0, t=210067 jiffies, g=985507, c=985506, q=588) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=108006 (detected by 0, t=225822 jiffies, g=985507, c=985506, q=588) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? __const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=115867 (detected by 0, t=241577 jiffies, g=985507, c=985506, q=588) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=123728 (detected by 0, t=257332 jiffies, g=985507, c=985506, q=588) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19 INFO: rcu_sched detected stalls on CPUs/tasks: 7-...: (2 GPs behind) idle=b43/140000000000000/0 softirq=7775986/7775986 fqs=131589 (detected by 0, t=273087 jiffies, g=985507, c=985506, q=588) Task dump for CPU 7: kworker/u16:2 D 0 8609 2 0x00000008 ffffc90006f5fc58 ffffffff8107d522 ffffc90006f5fc80 ffffc90006f5fc70 ffffffff812653ba ffffc90006f5fc80 ffffffff812653ea ffffc90006f5fca8 ffffffff812f6842 ffffffff8199e200 000000000000000a ffffffff81968c2d Call Trace: [\] ? oops_exit+0x22/0x30 [\] ? delay+0xa/0x10 [\] ? __const_udelay+0x2a/0x30 [\] wait_for_xmitr+0x32/0x90 [\] ? delay+0xa/0x10 [\] ? const_udelay+0x2a/0x30 [\] ? wait_for_xmitr+0x32/0x90 [\] ? serial8250_console_write+0x204/0x250 [\] ? bad_area_nosemaphore+0xf/0x20 [\] ? univ8250_console_write+0x19/0x20 [\] ? irq_work_queue+0xf/0x70 [\] ? wake_up_klogd+0x2f/0x40 [\] ? console_unlock+0x51f/0x550 [\] ? vprintk_emit+0x273/0x430 [\] ? _raw_spin_lock+0x1b/0x20 [\] ? schedule+0x6b/0x580 [\] ? schedule+0x31/0x80 [\] ? do_exit+0x884/0xab0 [\] ? rewind_stack_do_exit+0x17/0x19

fszyang commented 7 years ago

Still happening on

commit f621bae58cfe467e4cbe5c1dd8d630c3c67dc3e7

fszyang commented 7 years ago

Tried the stop/start test on invader2(same invader as previous) with a longer wait cycle (7 seconds) between stop and start, and it still hung, the CPU but did take longer this time (10 hours).

I see this messages repeated on console quite frequently throughout. Not sure if related to hang.

DMAR: [DMA Write] Request device [04:00.0] fault addr ffc0b000 [fault reason 05] PTE Write access is not set


Only able to capture the last snippet on the console this time when CPU hung: ... [\] complete+0x32/0x50 [\] mm_release+0x9e/0x100 [\] do_exit+0x284/0xab0 [\] rewind_stack_do_exit+0x17/0x19 Code: 00 00 00 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 41 89 f7 53 4c 8d 67 08 41 89 ce 48 83 ec 10 89 55 cc 48 8b 57 08 4c 89 45 d0 <48> 8b 32 49 39 d4 48 8d 42 e8 4c 8d 6e e8 75 05 eb 38 49 89 d5 RIP [] __wake_up_common+0x26/0x80 RSP CR2: 0000000000000009 ---[ end trace e8c82b947cc9e48e ]--- Fixing recursive fault but reboot is needed! BUG: unable to handle kernel paging request at ffffffffffffffd8 IP: [] kthread_data+0xb/0x20 PGD 7ea08067 PUD 7ea0a067 PMD 0 Oops: 0000 [#3] SMP Modules linked in: xt_nat ixgbevf ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay iptable_raw nls_utf8 nls_cp437 vfat fat kvm_intel kvm uio_pci_dma i2c_i801 i2c_smbus autofs4 dm_mod ixgbe mdio CPU: 0 PID: 10675 Comm: kworker/u16:1 Tainted: G D 4.9.8-platina-mk1-amd64 #2 Hardware name: Intel Camelback Mountain Platina DC/Camelback Mountain Platina DC, BIOS 6a724f1-dirty 03/14/2017 task: ffff880465adb000 task.stack: ffffc9000b2b0000 RIP: 0010:[] [] kthread_data+0xb/0x20 RSP: 0018:ffffc9000b2b3e78 EFLAGS: 00010002 RAX: 0000000000000000 RBX: 0000000000015300 RCX: 0000000000000000 RDX: ffff88046d805000 RSI: ffff880465adb000 RDI: ffff880465adb000 RBP: ffffc9000b2b3e78 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 00000000000001d9 R12: ffff88047fc15300 R13: ffff880465adb000 R14: ffff880465adb4b0 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 000000007ea07000 CR4: 00000000003406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: ffffc9000b2b3e88 ffffffff81092a79 ffffc9000b2b3ec8 ffffffff814c328e ffffc9000b2b3ef0 ffff880465adb000 0000000000000009 0000000000000046 ffff880465adb000 0000000000000000 ffffc9000b2b3ee0 ffffffff814c3571 Call Trace: [\] wq_worker_sleeping+0x9/0x80 [\] __schedule+0x2ce/0x580 [\] schedule+0x31/0x80 [\] do_exit+0x884/0xab0 [\] rewind_stack_do_exit+0x17/0x19 Code: be 7d 04 00 00 48 c7 c7 b8 75 6d 81 e8 af 5c fe ff eb ca 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 87 50 04 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 RIP [] kthread_data+0xb/0x20 RSP CR2: ffffffffffffffd8 ---[ end trace e8c82b947cc9e48f ]--- Fixing recursive fault but reboot is needed!

fszyang commented 7 years ago

Did the same test on invader7, and it did NOT hang over 13 hours.

Bother invader7 and invader2 have the same rev of goes f621bae58cfe467e4cbe5c1dd8d630c3c67dc3e7

and the same kernel 4.9.8-platina-mk1-amd64

SRIOV enabled

Saw similar messages on console as invader2. Less frequent and seem to happen in burst. DMAR: [DMA Write] Request device [04:00.0] fault addr ffc0e000 [fault reason 05] PTE Write access is not set DMAR: DRHD: handling fault status reg 500

fszyang commented 7 years ago

Tested with newer ixgbevf driver version 4.4 on kernel 4.9.8. Do not see the crash/hang anymore on invader2 which was hanging previously.