open-power-host-os / linux

Linux kernel source tree
Other
3 stars 4 forks source link

Power9: Host crash during SMT change with guest emulator thread pinned "Oops: Kernel access of bad area, sig: 11 [#1]" #17

Closed sathnaga closed 6 years ago

sathnaga commented 7 years ago

Host Kernel: 4.13.0-4.rel.git49564cb.el7.centos.ppc64le

Steps to reproduce:

  1. Boot a guest(vm1)
  2. pin emulator thread to last host cpu virsh emulatorpin vm1 79 --live --config
  3. Change host SMT from 4 to 2 ppc64_cpu --smt=2 ====> Host hit with crash and become unresposive

part of guest xml

<domain type='kvm'>
  <name>vm1</name>
  <uuid>8914b703-4133-4564-bb39-108159f0f2b8</uuid>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <emulatorpin cpuset='79'/>
  </cputune>
  <os>
    <type arch='ppc64le' machine='pseries-2.10'>hvm</type>
    <boot dev='hd'/>
  </os>
  <cpu>
    <topology sockets='1' cores='4' threads='1'/>
  </cpu>

Host hung and unresponsive, needs a external reboot to bring back.

# [175192.775110] IRQ 33: no longer affine to CPU2
[175193.513117] IRQ 51: no longer affine to CPU7
[175193.918060] IRQ 36: no longer affine to CPU10
[175194.898718] IRQ 32: no longer affine to CPU15
[175195.497593] IRQ 24: no longer affine to CPU23
[175195.847274] IRQ 59: no longer affine to CPU27
[175196.156829] IRQ 39: no longer affine to CPU31
[175196.514113] IRQ 38: no longer affine to CPU35
[175196.845370] IRQ 52: no longer affine to CPU38
[175197.016417] IRQ 50: no longer affine to CPU39
[175197.935579] irq_migrate_all_off_this_cpu: 1 callbacks suppressed
[175197.935582] IRQ 69: no longer affine to CPU51
[175198.195199] IRQ 56: no longer affine to CPU55
[175198.345390] IRQ 57: no longer affine to CPU62
[175198.506220] IRQ 28: no longer affine to CPU63
[175199.224386] IRQ 66: no longer affine to CPU71
[175199.554113] IRQ 35: no longer affine to CPU75
[175199.694068] IRQ 37: no longer affine to CPU78
[175199.852866] Unable to handle kernel paging request for data at address 0x000008c8
[175199.852938] Faulting instruction address: 0xc0000000001d0184
[175199.852953] Oops: Kernel access of bad area, sig: 11 [#1]
[175199.853004] SMP NR_CPUS=1024 
[175199.853005] NUMA 
[175199.853045] PowerNV
[175199.853098] Modules linked in: target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache binfmt_misc vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel opal_prd nfsd auth_rpcgss oid_registry nfs_acl
[175199.853785]  lockd grace kvm_hv sunrpc kvm tg3 ptp pps_core
[175199.853856] CPU: 79 PID: 64710 Comm: kworker/79:2 Not tainted 4.13.0-4.rel.git49564cb.el7.centos.ppc64le #1
[175199.853961] Workqueue: events cpuset_hotplug_workfn
[175199.854014] task: c0000003a2a22600 task.stack: c0000003a2ac8000
[175199.854077] NIP: c0000000001d0184 LR: c0000000001d0170 CTR: c0000000001d0130
[175199.854153] REGS: c0000003a2acb710 TRAP: 0300   Not tainted  (4.13.0-4.rel.git49564cb.el7.centos.ppc64le)
[175199.854241] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[175199.854249]   CR: 448e2022  XER: 20040000
[175199.854349] CFAR: c0000000001c3db0 DAR: 00000000000008c8 DSISR: 40000000 SOFTE: 1 
[175199.854349] GPR00: c0000000001d0170 c0000003a2acb990 c000000001397a00 0000000000000000 
[175199.854349] GPR04: c0000003a2acb9b0 0000000000000000 c0000003a2acbab0 c000000245975678 
[175199.854349] GPR08: c000000245975678 c0000003a2acb948 c0000000015a7a00 0000000000000000 
[175199.854349] GPR12: c0000000001d0130 c00000000fdb1600 c000000000124348 c000000036de4e80 
[175199.854349] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000001 
[175199.854349] GPR20: c000000005be6940 c000000005be6960 0000000000000000 0000000000000000 
[175199.854349] GPR24: c000000001334de0 c0000000015a09e0 c000000001264488 c0000003a2acbab0 
[175199.854349] GPR28: c0000003aae63c00 c0000003a2acbaa0 c0000003a2acba10 0000000000000000 
[175199.855075] NIP [c0000000001d0184] cpuset_can_attach+0x54/0x1a0
[175199.855191] LR [c0000000001d0170] cpuset_can_attach+0x40/0x1a0
[175199.855304] Call Trace:
[175199.855355] [c0000003a2acb990] [c0000000001d0170] cpuset_can_attach+0x40/0x1a0 (unreliable)
[175199.855519] [c0000003a2acb9f0] [c0000000001c4dd4] cgroup_migrate_execute+0xc4/0x4c0
[175199.855657] [c0000003a2acba60] [c0000000001cc3d4] cgroup_transfer_tasks+0x1e4/0x380
[175199.855796] [c0000003a2acbb90] [c0000000001d2810] cpuset_hotplug_workfn+0x6e0/0x900
[175199.855934] [c0000003a2acbc90] [c00000000011bc00] process_one_work+0x1a0/0x490
[175199.856072] [c0000003a2acbd30] [c00000000011bf88] worker_thread+0x98/0x520
[175199.856188] [c0000003a2acbdc0] [c0000000001244a8] kthread+0x168/0x1b0
[175199.856304] [c0000003a2acbe30] [c00000000000bc60] ret_from_kernel_thread+0x5c/0x7c
[175199.856441] Instruction dump:
[175199.856513] fbc1fff0 fbe1fff8 f8010010 f821ffa1 38810020 7c7d1b78 4bff3c6d 60000000 
[175199.856655] 3f42ffed 3d420021 eb610020 3b5aca88 <e92308c8> 7f43d378 e9290000 f92a90c8 
[175199.856800] ---[ end trace 5aa84a7cf504a434 ]---
[175199.868456] 
[175201.708433] process 150492 (vhost-150463) no longer affine to cpu79
Mirrored with LTC bug #159341
cdeadmin commented 7 years ago

------- Comment From bssrikanth@in.ibm.com 2017-09-27 05:08:00 EDT------- Similar issue noted with Pegas 1.0 testing as well @ bug 159286

cdeadmin commented 7 years ago

------- Comment From jamesspo@us.ibm.com 2017-10-20 14:10:40 EDT------- Moving to Sprint 2, but let's mention it in the annouce details.

cdeadmin commented 6 years ago

------- Comment From bssrikanth@in.ibm.com 2017-11-21 05:14:33 EDT------- Have requested Satheesh to test with latest release branch

sathnaga commented 6 years ago
Tested in latest release branch 4.14.0-1.rel.git68b4afb.el7.centos.ppc64le, and issue is fixed.

# virsh destroy vm1;virsh start vm1
Domain vm1 destroyed

Domain vm1 started

# virsh emulatorpin vm1
emulator: CPU Affinity
----------------------------------
       *: 63

# virsh vcpupin vm1
VCPU: CPU Affinity
----------------------------------
   0: 0-63
   1: 0-63
   2: 0-63
   3: 0-63

# ppc64_cpu --smt
SMT=4
# ppc64_cpu --smt=2

#  ppc64_cpu --smt
SMT=2
cdeadmin commented 6 years ago

------- Comment From satheera@in.ibm.com 2017-11-28 05:25:35 EDT-------

Regards, -Satheesh.