Open pridhiviraj opened 6 years ago
Looks like the TB resync has failed and hence the hmi recovery. TOD error will generate HMI on all the cores. Can you reproduce this issue with a TB error that generates HMI on one single core and then loop over to see after what count TB fails to re-sync ?
@maheshsal Yes, after 20k TB error's on a single core it is hitting the same issue.
0000080000000000
18517
0000080000000000
18518
0000080000000000
18519
0000080000000000
18520
[console-pexpect]#
[console-pexpect]#[ 1.374114245,3] CHIPTOD: TB "Not Set" TOD in error state
[18014398511.108890053,4] HMI: Failed to get TB in running state! CPU=41, TFMR=2a12800980a44000
[18014398511.108890053,4] HMI: Failed to get TB in running state! CPU=43, TFMR=2a12800980a44000
[18014398511.108890053,4] HMI: Failed to get TB in running state! CPU=40, TFMR=2a12800980a44000
[18014398511.108890053,4] HMI: Failed to get TB in running state! CPU=42, TFMR=2a12800980a44000
[66901.841558] opal: Hardware platform error: Unrecoverable HMI exception
[66903.108438] WARNING: CPU: 42 PID: 826 at /build/linux-GKZ1fU/linux-4.15.0/kernel/sched/core.c:1189 set_task_cpu+0x240/0x250
[66903.108541] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter devlink ip6_tables iptable_filter joydev input_leds mac_hid ofpart idt_89hpesx cmdlinepart powernv_flash mtd ipmi_powernv ipmi_devintf vmx_crypto at24 uio_pdrv_genirq uio ipmi_msghandler ibmpowernv opal_prd kvm_hv kvm sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure ast i2c_algo_bit ttm hid_generic
[66903.109196] drm_kms_helper mpt3sas syscopyarea sysfillrect sysimgblt fb_sys_fops nvme usbhid crct10dif_vpmsum raid_class hid crc32c_vpmsum drm nvme_core i40e aacraid scsi_transport_sas
[66903.109352] CPU: 42 PID: 826 Comm: kworker/42:1 Not tainted 4.15.0-18-generic #19-Ubuntu
[66903.109424] Workqueue: events hmi_event_handler
[66903.109471] NIP: c00000000014d6e0 LR: c00000000014e30c CTR: c00000000015a240
[66903.109538] REGS: c0000000fe66af50 TRAP: 0700 Not tainted (4.15.0-18-generic)
[66903.109603] MSR: 9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 28008424 XER: 00000000
[66903.109707] CFAR: c00000000014d54c SOFTE: 0
[66903.109707] GPR00: c00000000014e30c c0000000fe66b1d0 c0000000016eae00 c00020395bcac900
[66903.109707] GPR04: 0000000000000040 0000000000000040 0000000000000000 0000000000000000
[66903.109707] GPR08: c000000001721ee0 0000000000000000 0000000000000008 00000000000748d0
[66903.109707] GPR12: 0000000028008224 c00000000fa9ce00 c00000000013c788 c000003fe0b0ca80
[66903.109707] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000003fee840000
[66903.109707] GPR20: c000003fefa035c0 0000000000000000 0000000000000000 c0000000fe66b350
[66903.109707] GPR24: c00020395bcacd28 c00000000171dd78 c0000000011d8580 0000000000000000
[66903.109707] GPR28: 0000000000000004 0000000000000040 0000000000000040 c00020395bcac900
[66903.110286] NIP [c00000000014d6e0] set_task_cpu+0x240/0x250
[66903.110332] LR [c00000000014e30c] try_to_wake_up+0x1bc/0x660
[66903.110387] Call Trace:
[66903.110415] [c0000000fe66b1d0] [c0000000011d8580] runqueues+0x0/0xc00 (unreliable)
[66903.110484] [c0000000fe66b210] [c00000000014e30c] try_to_wake_up+0x1bc/0x660
[66903.110554] [c0000000fe66b290] [c0000000001725d8] autoremove_wake_function+0x28/0x70
[66903.110622] [c0000000fe66b2c0] [c000000000171b60] __wake_up_common+0xd0/0x200
[66903.110691] [c0000000fe66b330] [c000000000171d4c] __wake_up_common_lock+0xbc/0x110
[66903.110760] [c0000000fe66b3c0] [c00000000018ea40] wake_up_klogd_work_func+0x60/0xc0
[66903.110830] [c0000000fe66b3f0] [c000000000295d10] irq_work_run_list+0xb0/0x100
[66903.110901] [c0000000fe66b430] [c0000000001b7220] update_process_times+0x60/0x90
[66903.110970] [c0000000fe66b460] [c0000000001cef54] tick_sched_handle.isra.5+0x34/0xd0
[66903.111037] [c0000000fe66b490] [c0000000001cf050] tick_sched_timer+0x60/0xe0
[66903.111120] [c0000000fe66b4d0] [c0000000001b7db4] __hrtimer_run_queues+0x144/0x370
[66903.111188] [c0000000fe66b550] [c0000000001b8d0c] hrtimer_interrupt+0xfc/0x350
[66903.111258] [c0000000fe66b620] [c0000000000248f0] __timer_interrupt+0x90/0x260
[66903.111327] [c0000000fe66b670] [c000000000024d08] timer_interrupt+0x98/0xe0
[66903.111387] [c0000000fe66b6a0] [c000000000009014] decrementer_common+0x114/0x120
[66903.111460] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
[66903.111460] LR = arch_local_irq_restore+0x74/0x90
[66903.111551] [c0000000fe66b990] [c0000000fe66b9d0] 0xc0000000fe66b9d0 (unreliable)
[66903.111622] [c0000000fe66b9b0] [c000000000cff180] _raw_spin_unlock_irqrestore+0x40/0xa0
[66903.111691] [c0000000fe66b9d0] [c00000000057d5fc] pstore_dump+0x31c/0x3e0
[66903.118016] [c0000000fe66bb10] [c00000000018ec64] kmsg_dump+0x134/0x1a0
[66903.124952] [c0000000fe66bb70] [c0000000000a2f14] pnv_platform_error_reboot+0x94/0x110
[66903.131898] [c0000000fe66bbe0] [c0000000000a842c] hmi_event_handler+0x1bc/0x1c0
[66903.140209] [c0000000fe66bc90] [c000000000133958] process_one_work+0x298/0x5a0
[66903.147152] [c0000000fe66bd20] [c000000000133cf8] worker_thread+0x98/0x630
[66903.154074] [c0000000fe66bdc0] [c00000000013c928] kthread+0x1a8/0x1b0
[66903.159636] [c0000000fe66be30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4
[66903.167942] Instruction dump:
[66903.170729] 7faa3670 7d4a0194 57a706be 7d4a07b4 794a1f24 7d28502a 7d293c36 71290001
[66903.177660] 4082fe80 60000000 60000000 60420000 <0fe00000> 4bfffe6c 60000000 60420000
[66903.185965] ---[ end trace 739c2a2d69ccdc2b ]---
[67070.201983351,0] OPAL: Reboot requested due to Platform error.
[67070.205232792,3] OPAL: Reboot requested due to Platform error.[67070.208087613,5] Software initiated checkstop disabled.
[6[67070.210438942,5] OPAL: Reboot request...
6903.561762] opal: Reboot type 1 not supported for Unrecoverable HMI exception
On a second re-create also at the same count system got hung this time.
[ 90.410845] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.416395] ErroLOCK ERROR: Unlocked non-owned lock @0x30303698 (state: 0x0000004000000001)
[18014398766.211691469,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[18014398766.211691469,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc0]: P:0 C:16 T:3: TFMR(2a12000980a44000) Timer Facility Error
[ 256.481751932,3] OPAL exiting with locks held, token=145 retval=0
[ 256.481756229,3] core/lock.c:216
[ 256.481758491,3] core/lock.c:216
[18014398766.211691469,0] Aborting!
r detail: TiCPU 0040 Backtrace:
S: 0000000031e07910 R: 000000003001a4f8 E ._abort+0x4c
S: 0000000031e07990 R: 0000000030017e88 E .lock_error+0x64
S: 0000000031e07a10 R: 0000000030017988 E .unlock+0x60
S: 0000000031e07a80 R: 0000000030017b40 E .lock_caller+0xf4
S: 0000000031e07b30 R: 0000000030039b60 E .__uart_do_poll+0x40
S: 0000000031e07c20 R: 000000003001ba70 E .opal_run_pollers+0x168
S: 0000000031e07ca0 R: 000000003001bafc E .opal_poll_events+0x74
S: 0000000031e07d20 R: 00000000300051e4 E opal_entry+0x134
--- OPAL call token: 0xa caller R1: 0xc000003fe6a77a30 ---
mer facility experienced an error
[ 90.422007] HMER: 0840000000000000
[ 90.426216] TFMR: 2a12000980a44000
[ 90.428871] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.435800] Error detail: Timer facility experienced an error
[ 90.441469] HMER: 0840000000000000
[ 90.444120] TFMR: 2a12000980a54000
[ 90.448281] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.453833] Error detail: Timer facility experienced an error
[ 90.459507] HMER: 0840000000000000
[ 90.463537] TFMR: 2a12000980a54000
[ 90.466319] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.473242] Error detail: Timer facility experienced an error
[ 90.478941] HMER: 0840000000000000
[ 90.481570] TFMR: 2a12000980a44000
[ 90.485722] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.491279] Error detail: Timer facility experienced an error
[ 90.496955] HMER: 0840000000000000
[ 9[ 256.476915661,0] Assert fail: core/mem_region.c:444:lock_held_by_me(®ion->free_list_lock)
0.500979] TFMR: 2a12000980a44000
[ 90.503752] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.509306] Error detail: Timer facility experienced an error
[ 90.516383] HMER: 0840000000000000
[ 90.519016] TFMR: 2a12000980a44000
[ 90.523163] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.528719] Error detail: Timer facility experienced an error
[ 90.534420] HMER: 0840000000000000
[ 90.538418] TFMR: 2a12000980a54000
[ 90.541197] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.546747] Error detail: Timer facility experienced an error
[ 90.553675] HMER: 0840000000000000
[ 90.556448] TFMR: 2a12000980a44000
[ 90.560606] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.566279] Error detail: Timer facility experienced an error
[ 90.571708] HMER: 0840000000000000
[ 90.574484] TFMR: 2a12000980a54000
[ 90.578635] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.584245] Error detail: Timer facility experienced an error
[ 90.589738] HMER: 0840000000000000
[ 90.593888] TFMR: 2a12000980a54000
[ 90.596664] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.603734] Error detail: Timer facility experienced an error
[ 90.609148] HMER: 0840000000000000
[ 90.611916] TFMR: 2a12000980a44000
[ 90.616103] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.621632] Error detail: Timer facility experienced an error
[ 90.627178] HMER: 0840000000000000
[ 90.631331] TFMR: 2a12000980a44000
[ 90.634145] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.641035] Error detail: Timer facility experienced an error
[ 90.646593] HMER: 0840000000000000
[ 90.649362] TFMR: 2a12000980a54000
[ 90.653689] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.659068] Error detail: Timer facility experienced an error
[ 90.664621] HMER: 0840000000000000
[ 90.668769] TFMR: 2a12000980a44000
[ 90.671584] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.677099] Error detail: Timer facility experienced an error
[ 90.684024] HMER: 0840000000000000
[ 90.686806] TFMR: 2a12000980a44000
[ 90.691016] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.696514] Error detail: Timer facility experienced an error
[ 90.702060] HMER: 0840000000000000
[ 90.706207] TFMR: 2a12000980a54000
[ 90.708994] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.714674] Error detail: Timer facility experienced an error
[ 90.720088] HMER: 0840000000000000
[ 90.724239] TFMR: 2a12000980a44000
[ 90.727020] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.733942] Error detail: Timer facility experienced an error
[ 90.739503] HMER: 0840000000000000
[ 90.742275] TFMR: 2a12000980a44000
[ 90.746464] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.751982] Error detail: Timer facility experienced an error
[ 90.757531] HMER: 0840000000000000
[ 90.761680] TFMR: 2a12000980a44000
[ 90.764552] Severe Hypervisor Maintenance interrupt [Recovered]
[ 90.771385] [ 256.476915661,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 0 (sptr=00001ccc)
[18014398766.211691469,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 1 (sptr=00001ccc)
[ 256.476915661,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 2 (sptr=00001ccc)
[18014398514.386672589,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 0 (sptr=00002ccc)
[18014398514.386672589,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 1 (sptr=00002ccc)
[ 5.139896781,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 2 (sptr=00002ccc)
[ 5.139896781,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 0 (sptr=00003ccc)
[ 5.139896781,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 1 (sptr=00003ccc)
[18014398514.386672589,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 2 (sptr=00003ccc)
[ 5.139896781,4] HMI: Failed to get TB in running state! CPU=43, TFMR=2a12000980a44000
count:
0000080000000000
18282
0000080000000000
18283
0000080000000000
Paidipeddi Pridhiviraj notifications@github.com writes:
@maheshsal Yes, after 20k TB error's it is hitting the same issue on single core itself.
20,000 TB errors? Wow, we survive a lot of them!
I'd argue that this is something to look at after GA, if you've got that many TB errors, you possible have larger problems :)
-- Stewart Smith OPAL Architect, IBM.
By running TOD error recovery stress test, hitting a platform error followed by a system reboot.
OPAL level: