open-power-host-os / linux

Linux kernel source tree
Other
3 stars 4 forks source link

hit with host crash while running host stress tests #20

Closed sathnaga closed 6 years ago

sathnaga commented 7 years ago
Mirrored with LTC bug https://bugzilla.linux.ibm.com/show_bug.cgi?id=160748 While trying to reproduce with host stress, I hit with the below host crash during xfs stress tests enter ? for help [link register ] c0000000002543f0 irq_work_run+0x30/0x50 [c000000ffff53cc0] c000000ffff53cf0 (unreliable) [c000000ffff53cf0] c0000000001b7ca0 flush_smp_call_function_queue+0xf0/0x200 [c000000ffff53d70] c0000000000477ec smp_ipi_demux_relaxed+0x9c/0x110 [c000000ffff53db0] c0000000000903d4 icp_native_ipi_action+0x64/0x80 [c000000ffff53dd0] c000000000179420 __handle_irq_event_percpu+0x90/0x2d0 [c000000ffff53e90] c000000000179698 handle_irq_event_percpu+0x38/0x90 [c000000ffff53ed0] c00000000017fcf4 handle_percpu_irq+0x84/0xd0 [c000000ffff53f00] c000000000177b7c generic_handle_irq+0x4c/0x80 [c000000ffff53f20] c0000000000165d4 __do_irq+0x94/0x200 [c000000ffff53f90] c000000000029fa4 call_do_irq+0x14/0x24 [c0000007f87f3a50] c0000000000167dc do_IRQ+0x9c/0x110 [c0000007f87f3aa0] c000000000008c58 hardware_interrupt_common+0x158/0x160 --- Exception: 501 (Hardware Interrupt) at c0000000008eb664 snooze_loop+0xa4/0x190 [c0000007f87f3d90] c0000007f87f3dc0 (unreliable) [c0000007f87f3dd0] c0000000008e83a4 cpuidle_enter_state+0xc4/0x3d0 [c0000007f87f3e30] c00000000015f73c call_cpuidle+0x4c/0x80 [c0000007f87f3e50] c00000000015fbe0 do_idle+0x2b0/0x350 [c0000007f87f3ec0] c00000000015fe8c cpu_startup_entry+0x3c/0x50 [c0000007f87f3ef0] c000000000048a74 start_secondary+0x4e4/0x530 [c0000007f87f3f90] c00000000000b16c start_secondary_prolog+0x10/0x14 b:mon>
sathnaga commented 7 years ago

jenkins_job_log.txt

sathnaga commented 7 years ago

looks like this patch , https://www.spinics.net/lists/linux-fsdevel/msg117031.html fixes this issue

cdeadmin commented 7 years ago

------- Comment (attachment only) From diegodo@br.ibm.com 2017-10-31 13:38:03 EDT-------

------- Comment (attachment only) From diegodo@br.ibm.com 2017-10-31 13:39:35 EDT-------

cdeadmin commented 7 years ago

------- Comment (attachment only) From satheera@in.ibm.com 2017-11-02 06:48:20 EDT-------

cdeadmin commented 7 years ago

------- Comment From satheera@in.ibm.com 2017-11-10 04:17:55 EDT------- It hit with another host crash while running with patched kernel

ltc-test-ci1 login: [82658.681159] 6-1.test[74093]: unhandled signal 11 at 0000000000002713 nip 00007fffb89a4a38 lr 000000001000061c code 1 [82659.782214] 5-1.test[78324]: unhandled signal 11 at 00000000000186a3 nip 00007fff815a4904 lr 0000000010000620 code 1 [82660.354907] 6-2.test[79590]: unhandled signal 11 at 00007fff7ecf0000 nip 0000000010000bbc lr 0000000010000bb0 code 1 [82660.448273] 6-1.test[80370]: unhandled signal 11 at 00007fffa2a90000 nip 0000000010000a68 lr 0000000010000a50 code 1 [82660.485638] 6-3.test[80670]: unhandled signal 11 at 00007fffaefb0000 nip 0000000010000ac8 lr 0000000010000ab0 code 1 [82664.082546] 12-1.test[34195]: unhandled signal 11 at 0000000000002713 nip 00007fffb5a34b18 lr 000000001000063c code 1 [82665.367560] 6-1.test[56157]: unhandled signal 11 at 0000000000002713 nip 00007fffa2a34aa8 lr 0000000010000620 code 1 [84700.426452] tm-signal-msr-r[54506]: bad frame in rt_sigreturn: 00007fffc1a415d0 nip 00007fffa9d4eff0 lr 00007fffa9f104d8 [84700.448761] tm-signal-stack[54517]: bad frame in setup_rt_frame: 0000000000000000 nip 0000000010000cc4 lr 0000000010000ca8 [84772.550674] Bad kernel stack pointer 7fffccc104c0 at c00000000000bffc cpu 0x4a: Vector: 700 (Program Check) at [c00000003fc87d40] pc: c00000000000bffc: fast_exception_return+0xac/0x150 lr: 0000000010001c34 sp: 7fffccc104c0 msr: 9000000102a03031 current = 0xc000000edd1d1a80 paca = 0xc00000000fd90900 softe: 0 irq_happened: 0x01 pid = 56240, comm = tm-signal-conte Linux version 4.14.0-rc4+ (root@ltc-test-ci1.aus.stglabs.ibm.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-17) (GCC)) #2 SMP Tue Nov 7 07:38:54 EST 2017 WARNING: exception is not recoverable, can't continue enter ? for help SP (7fffccc104c0) is in userspace 4a:mon>

cdeadmin commented 6 years ago

------- Comment From satheera@in.ibm.com 2017-12-05 11:31:27 EDT------- Tested on 4.14.0-3.dev.git68b4afb.el7.centos.ppc64le.

cdeadmin commented 6 years ago

------- Comment From diegodo@br.ibm.com 2017-12-05 11:40:01 EDT------- I'll close this bug, since this patch is already on hostos kernel tree and the bug is not reproducible anymore.

Thank you