open-power-host-os / linux

Linux kernel source tree
Other
3 stars 4 forks source link

Host crashed while running memhotplug guest_sanity tests with latest devel branch #24

Closed sathnaga closed 7 years ago

sathnaga commented 7 years ago
Mirrored with LTC bug https://bugzilla.linux.ibm.com/show_bug.cgi?id=160986 Host was running guest_sanity tests. Kernel: 4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le ``` lr: d00000000b30e498: kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv] sp: c0000000ae89f850 msr: 900000010280b033 dar: d00000002b5bb20c dsisr: 40000000 current = 0xc0000001c4003080 paca = 0xc00000000fd8f400 softe: 0 irq_happened: 0x01 pid = 46914, comm = CPU 3/KVM Linux version 4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le (mockbuild@host-os-jenkins-slave03.aus.stglabs.ibm.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-17) (GCC)) #1 SMP Fri Oct 20 22:55:44 -02 2017 [66906.130198] KVM: CPU 44 seems to be stuck [66906.130257] KVM: CPU 46 seems to be stuck enter ? for help [c0000000aee2b8b0] d00000000b30e498 kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv] [c0000000aee2b9e0] d00000000b30a078 kvmppc_vcpu_run_hv+0xdf8/0x1300 [kvm_hv] [c0000000aee2bb30] d00000000b1348c4 kvmppc_vcpu_run+0x34/0x50 [kvm] [c0000000aee2bb50] d00000000b130d54 kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm] [c0000000aee2bbd0] d00000000b1239d8 kvm_vcpu_ioctl+0x598/0x7a0 [kvm] [c0000000aee2bd40] c0000000003832e0 do_vfs_ioctl+0xd0/0x8c0 [c0000000aee2bde0] c000000000383ba4 SyS_ioctl+0xd4/0x130 [c0000000aee2be30] c00000000000b8e0 system_call+0x58/0x6c --- Exception: c00 (System Call) at 00007fff8d0b674c SP (7fff597fde60) is in userspace 8:mon> ```
sathnaga commented 7 years ago

jenkins.txt

paulusmack commented 7 years ago

After some digging, it looks like one vcpu task has handled a hypervisor page fault while the resize code is in the middle of making all the HPTEs absent. The technique which the resize code uses to exclude vcpus from running (set hpte_setup_done to 0 and send an IPI to all CPUs) doesn't actually work since another vcpu task could be in the host handling a page fault or a hcall at the time the IPI is sent, in which case that vcpu task will just handle the IPI and continue to re-enter the guest.

I'm currently trying to think of a reasonable way to fix this...

cdeadmin commented 7 years ago

------- Comment From bssrikanth@in.ibm.com 2017-11-07 01:47:28 EDT------- Paul Mackerras seem to have patch which will hopefully fix this issue.. saw his comments on slack channel of host-os..

sathnaga commented 7 years ago

Seeing the issue is fixed with latest devel branch update, 4.14.0-2.rc8.dev.gitcc4bf22.el7.centos.ppc64le. will wait for release branch update for this fix to close the issue.

cdeadmin commented 7 years ago

------- Comment From viparash@in.ibm.com 2017-11-07 04:37:28 EDT------- Bug 160904 has been marked as a duplicate of this bug.

cdeadmin commented 7 years ago

Fixed with latest devel branch update, 4.14.0-2.rc8.dev.gitcc4bf22.el7.centos.ppc64le.

cdeadmin commented 7 years ago

------- Comment From lagarcia@br.ibm.com 2017-11-10 20:55:34 EDT------- Sprtin 3 hostos-release branch is closed for new commits. Targeting this one to sprint 4.

Paul,

Could you please cherry-pick this patch into hostos-release as soon as sprint 4 hostos-release branch gets opened?

sathnaga commented 7 years ago

Verified in latest hostos release branch 4.14.0-1.rel.git68b4afb.el7.centos.ppc64le

(7/9) guest_sanity.hotplug.memory.qemu.qcow2.virtio_scsi.smp2.virtio_net.HostOS.ppc64le.powerkvm-libvirt.libvirt_mem.positive_test.hot_plug: PASS (88.70 s)

Regards, -Satheesh.