open-power-host-os / qemu

OpenPOWER Host OS qemu repository
Other
2 stars 3 forks source link

Guest shuts-off with call trace after migrating between Boston and ZZ systems #19

Closed balamuruhans closed 6 years ago

balamuruhans commented 7 years ago
Mirrored with LTC bug https://bugzilla.linux.ibm.com/show_bug.cgi?id=160010 **Description:** Guest goes to shut-off state after migrating between Boston and ZZ systems, it is observed that migration succeeds without any issues but after migration in destination guest gets call traces and shuts off. **Call Trace:** ``` [root@localhost ~]# [ 89.329008] Unable to handle kernel paging request for instruction fetch [ 89.330591] Faulting instruction address: 0xc0000000000b19dc [ 89.331872] Oops: Kernel access of bad area, sig: 11 [#1] [ 89.333083] SMP NR_CPUS=1024 [ 89.333086] NUMA [ 89.333762] pSeries [ 89.334699] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables virtio_balloon virtio_blk virtio_net virtio_scsi [ 89.344729] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-4.rel.git49564cb.el7.centos.ppc64le #1 [ 89.347429] task: c000000001304780 task.stack: c000000001390000 [ 89.349149] NIP: c0000000000b19dc LR: c0000000000bfdf4 CTR: c00000000fd80000 [ 89.351232] REGS: c00000000fffb970 TRAP: 0400 Not tainted (4.13.0-4.rel.git49564cb.el7.centos.ppc64le) [ 89.353951] MSR: 8000000040009033 [ 89.353989] CR: 48002042 XER: 00000000 [ 89.356620] CFAR: c00000000037f7f8 SOFTE: 0 [ 89.356620] GPR00: 0000000028002042 c00000000fffbbf0 c000000001397a00 0000000000000000 [ 89.356620] GPR04: 0000000000000000 0000000000000000 0000000000000000 000000000000004e [ 89.356620] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000059 [ 89.356620] GPR12: c00000000065ffd0 c00000000fd80000 000000000dc5bd20 0000000000000060 [ 89.356620] GPR16: 0000000002cd41d8 fffffffffffffffd 000000000dc5bd20 000000000e453a80 [ 89.356620] GPR20: c000000001299f98 c00000000fffbd00 0000000000000001 0000000000000000 [ 89.356620] GPR24: 0000000000000000 c00000000169087c 0000000000000000 c00000000fffbd00 [ 89.356620] GPR28: 0000000000000010 c000000001690878 c000000001690738 c00000000169087c [ 89.376064] NIP [c0000000000b19dc] plpar_hcall+0x38/0x58 [ 89.377556] LR [c0000000000bfdf4] hvc_get_chars+0x34/0x90 [ 89.379120] Call Trace: [ 89.379843] [c00000000fffbbf0] [c0000001f1313c00] 0xc0000001f1313c00 (unreliable) [ 89.382061] [c00000000fffbc80] [c000000000660108] hvterm_raw_get_chars+0x138/0x1e0 [ 89.384245] [c00000000fffbce0] [c000000000662990] hvc_poll+0x120/0x380 [ 89.386076] [c00000000fffbd80] [c000000000663d64] hvc_handle_interrupt+0x24/0x50 [ 89.388154] [c00000000fffbda0] [c000000000171680] __handle_irq_event_percpu+0x90/0x2d0 [ 89.390466] [c00000000fffbe60] [c0000000001718f8] handle_irq_event_percpu+0x38/0x90 [ 89.392735] [c00000000fffbea0] [c0000000001719b8] handle_irq_event+0x68/0xd0 [ 89.394748] [c00000000fffbed0] [c000000000176e74] handle_fasteoi_irq+0xc4/0x1f0 [ 89.396811] [c00000000fffbf00] [c00000000016fd9c] generic_handle_irq+0x4c/0x80 [ 89.398865] [c00000000fffbf20] [c000000000016624] __do_irq+0x94/0x200 [ 89.400717] [c00000000fffbf90] [c00000000002afe4] call_do_irq+0x14/0x24 [ 89.402592] [c0000000013939e0] [c00000000001682c] do_IRQ+0x9c/0x110 [ 89.404380] [c000000001393a30] [c000000000008c58] hardware_interrupt_common+0x158/0x160 [ 89.406641] --- interrupt: 501 at plpar_hcall_norets+0x1c/0x28 [ 89.406641] LR = check_and_cede_processor+0x34/0x50 [ 89.409814] [c000000001393d20] [c0000000008caa60] check_and_cede_processor+0x20/0x50 (unreliable) [ 89.412348] [c000000001393d80] [c0000000008cae30] shared_cede_loop+0x50/0x160 [ 89.414346] [c000000001393db0] [c0000000008c7f64] cpuidle_enter_state+0xc4/0x3d0 [ 89.416427] [c000000001393e10] [c000000000157ccc] call_cpuidle+0x4c/0x80 [ 89.418304] [c000000001393e30] [c000000000158170] do_idle+0x2b0/0x350 [ 89.420133] [c000000001393ea0] [c000000000158418] cpu_startup_entry+0x38/0x40 [ 89.422177] [c000000001393ed0] [c00000000000d924] rest_init+0xf4/0x110 [ 89.424049] [c000000001393f00] [c000000000e14254] start_kernel+0x530/0x54c [ 89.426003] [c000000001393f90] [c00000000000b27c] start_here_common+0x1c/0x520 [ 89.428121] Instruction dump: [ 89.429004] 7c421378 7c000026 90010008 60000000 f8810028 7ca42b78 7cc53378 7ce63b78 [ 89.431251] 7d074378 7d284b78 7d495378 44000022 f88c0000 f8ac0008 f8cc0010 [ 89.433525] ---[ end trace 89de1301a015647c ]--- ``` **Test Environment:** ``` Libvirt - 3.6.0-3.rel.gitdd9401b.el7.centos.ppc64le Qemu - 2.10.0-2.rel.gitc334a4e.el7.centos.ppc64le SLOF - SLOF-20170724-2.rel.gitea31295.el7.centos.noarch Guest Kernel - 4.13.0-4.rel.git49564cb.el7.centos.ppc64le Host Kernel - 4.13.0-4.rel.git49564cb.el7.centos.ppc64le ```
cdeadmin commented 7 years ago

------- Comment From viparash@in.ibm.com 2017-10-23 04:31:48 EDT------- > [root@localhost ~]# [ 89.329008] Unable to handle kernel paging request > for instruction fetch > [ 89.330591] Faulting instruction address: 0xc0000000000b19dc > [ 89.331872] Oops: Kernel access of bad area, sig: 11 [#1] > [ 89.333083] SMP NR_CPUS=1024 > [ 89.333086] NUMA > [ 89.333762] pSeries > [ 89.334699] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 > nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat > ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack > libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter > ebtables ip6table_filter ip6_tables virtio_balloon virtio_blk virtio_net > virtio_scsi > [ 89.344729] CPU: 0 PID: 0 Comm: swapper/0 Not tainted > 4.13.0-4.rel.git49564cb.el7.centos.ppc64le #1 > [ 89.347429] task: c000000001304780 task.stack: c000000001390000 > [ 89.349149] NIP: c0000000000b19dc LR: c0000000000bfdf4 CTR: > c00000000fd80000 > [ 89.351232] REGS: c00000000fffb970 TRAP: 0400 Not tainted > (4.13.0-4.rel.git49564cb.el7.centos.ppc64le) > [ 89.353951] MSR: 8000000040009033 <SF,EE,ME,IR,DR,RI,LE> > [ 89.353989] CR: 48002042 XER: 00000000 > [ 89.356620] CFAR: c00000000037f7f8 SOFTE: 0 > [ 89.356620] GPR00: 0000000028002042 c00000000fffbbf0 c000000001397a00 > 0000000000000000 > [ 89.356620] GPR04: 0000000000000000 0000000000000000 0000000000000000 > 000000000000004e > [ 89.356620] GPR08: 0000000000000000 0000000000000000 0000000000000000 > 0000000000000059 > [ 89.356620] GPR12: c00000000065ffd0 c00000000fd80000 000000000dc5bd20 > 0000000000000060 > [ 89.356620] GPR16: 0000000002cd41d8 fffffffffffffffd 000000000dc5bd20 > 000000000e453a80 > [ 89.356620] GPR20: c000000001299f98 c00000000fffbd00 0000000000000001 > 0000000000000000 > [ 89.356620] GPR24: 0000000000000000 c00000000169087c 0000000000000000 > c00000000fffbd00 > [ 89.356620] GPR28: 0000000000000010 c000000001690878 c000000001690738 > c00000000169087c > [ 89.376064] NIP [c0000000000b19dc] plpar_hcall+0x38/0x58 > [ 89.377556] LR [c0000000000bfdf4] hvc_get_chars+0x34/0x90 > [ 89.379120] Call Trace: > [ 89.379843] [c00000000fffbbf0] [c0000001f1313c00] 0xc0000001f1313c00 > (unreliable) > [ 89.382061] [c00000000fffbc80] [c000000000660108] > hvterm_raw_get_chars+0x138/0x1e0 > [ 89.384245] [c00000000fffbce0] [c000000000662990] hvc_poll+0x120/0x380 > [ 89.386076] [c00000000fffbd80] [c000000000663d64] > hvc_handle_interrupt+0x24/0x50 > [ 89.388154] [c00000000fffbda0] [c000000000171680] > handle_irq_event_percpu+0x90/0x2d0 > [ 89.390466] [c00000000fffbe60] [c0000000001718f8] > handle_irq_event_percpu+0x38/0x90 > [ 89.392735] [c00000000fffbea0] [c0000000001719b8] > handle_irq_event+0x68/0xd0 > [ 89.394748] [c00000000fffbed0] [c000000000176e74] > handle_fasteoi_irq+0xc4/0x1f0 > [ 89.396811] [c00000000fffbf00] [c00000000016fd9c] > generic_handle_irq+0x4c/0x80 > [ 89.398865] [c00000000fffbf20] [c000000000016624] do_irq+0x94/0x200 > [ 89.400717] [c00000000fffbf90] [c00000000002afe4] call_do_irq+0x14/0x24 > [ 89.402592] [c0000000013939e0] [c00000000001682c] do_IRQ+0x9c/0x110 > [ 89.404380] [c000000001393a30] [c000000000008c58] > hardware_interrupt_common+0x158/0x160 > [ 89.406641] --- interrupt: 501 at plpar_hcall_norets+0x1c/0x28 > [ 89.406641] LR = check_and_cede_processor+0x34/0x50 > [ 89.409814] [c000000001393d20] [c0000000008caa60] > check_and_cede_processor+0x20/0x50 (unreliable) > [ 89.412348] [c000000001393d80] [c0000000008cae30] > shared_cede_loop+0x50/0x160 > [ 89.414346] [c000000001393db0] [c0000000008c7f64] > cpuidle_enter_state+0xc4/0x3d0 > [ 89.416427] [c000000001393e10] [c000000000157ccc] call_cpuidle+0x4c/0x80 > [ 89.418304] [c000000001393e30] [c000000000158170] do_idle+0x2b0/0x350 > [ 89.420133] [c000000001393ea0] [c000000000158418] > cpu_startup_entry+0x38/0x40 > [ 89.422177] [c000000001393ed0] [c00000000000d924] rest_init+0xf4/0x110 > [ 89.424049] [c000000001393f00] [c000000000e14254] start_kernel+0x530/0x54c > [ 89.426003] [c000000001393f90] [c00000000000b27c] > start_here_common+0x1c/0x520 > [ 89.428121] Instruction dump: > [ 89.429004] 7c421378 7c000026 90010008 60000000 f8810028 7ca42b78 > 7cc53378 7ce63b78 > [ 89.431251] 7d074378 7d284b78 7d495378 44000022 <e9810028> f88c0000 > f8ac0008 f8cc0010 > [ 89.433525] ---[ end trace 89de1301a015647c ]---

Guest is shutting down due to tty driver crashing with ISI exception. Its crashing in hvc_get_chars() routine while executing H_GET_TERM_CHAR hcall.

cdeadmin commented 7 years ago

------- Comment From viparash@in.ibm.com 2017-10-23 05:05:11 EDT------- (In reply to comment #2)

> Guest is shutting down due to tty driver crashing with ISI exception. > Its crashing in hvc_get_chars() routine while executing H_GET_TERM_CHAR > hcall.

Hi Leonardo,

Guest is crashing with ISI execption in tty driver soon after migration. It doesn't boot subsequently also and crashes each time. Seems to be qemu issue here. Can you please have some one from look into this ?

cdeadmin commented 6 years ago

------- Comment From danielhb@br.ibm.com 2017-12-06 15:10:36 EDT------- I wasn't able to reproduce this bug. The test environment was:

I tried migrating with 2 different guests: Ubuntu 17.04 with kernel 4.10.0-28-ppc64le and Fedora26 with kernel 4.13.16-202.fc26.ppc64le. With both guests, migration between ZZ -> Boston Boston -> ZZ worked without problems.

I see that the bug was opened against QEMU 2.10.0-2. It is likely that whatever caused the bug to happen is already fixed upstream.

cdeadmin commented 6 years ago

------- Comment From danielhb@br.ibm.com 2018-02-14 11:01:16 EDT------- I've re-read this bug and this looks remarkably similar to a kernel bug that was already sorted out here:

https://bugzilla.linux.ibm.com/show_bug.cgi?id=163870

This is the patch that fixes that bug:

https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=62e984ddfd6b056d399e24113f5e6a

It's not upstream yet but it is already in the 'fixes' branch of the ppc kernel maintainer. As soon as it is upstream we can get a patched Host OS kernel to verify if it fixes this bug we're seeing here too.

cdeadmin commented 6 years ago

------- Comment From danielhb@br.ibm.com 2018-07-31 10:47:03 EDT------- Bala, can you re-test this bug with latest HostOS? It should be already fixed.

cdeadmin commented 6 years ago

------- Comment From seg@us.ibm.com 2018-08-31 13:20:06 EDT------- No response in a long time, so just going to close.