open-power / skiboot

OPAL boot and runtime firmware for POWER
Apache License 2.0
99 stars 134 forks source link

[ 2687.794321] opal: Hardware platform error: Unrecoverable HMI exception #174

Open pridhiviraj opened 6 years ago

pridhiviraj commented 6 years ago

By running TOD error recovery stress test, hitting a platform error followed by a system reboot.

[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:12 T:1: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:12 T:2: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:22 T:1: TFMR(2a12000980a94000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:7 T:0: TFMR(2a12000980a85000) Timer Facility Error
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:15 T:1: TFMR(2a12000980a94000) Timer Facility Error
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:23 T:3: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:15 T:2: TFMR(2a12000980a94000) Timer Facility Error
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:23 T:2: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:23 T:1: TFMR(2a12000980a85000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:23 T:0: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:20 T:0: TFMR(2a12000980a85000) Timer Facility Error
[    1.039373791,3] CHIPTOD: TB "Not Set" TOD in error state
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:13 T:2: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:20 T:3: TFMR(2a12000980a85000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:20 T:1: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:20 T:2: TF2000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:13 T:1: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:14 T:2: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[    1.039373791,4] HMI: Failed to get TB in running state! CPU=85f, TFMR=2a12800980a04000
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:14 T:3: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:14 T:0: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:14 T:1: TFMR(2a12000980a84000) Timer Facility Error
[    1.039373791,4] HMI: Failed to get TB in running state! CPU=85d, TFMR=2a12800980a04000
[    1.039373791,4] HMI: Failed to get TB in running state! CPU=85c, TFMR=2a12800980a04000
[    1.039373791,4] HMI: Failed to get TB in running state! CPU=85e, TFMR=2a12800980a04000
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:6 T:1: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:6 T:0: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:6 T:2: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:7 T:2: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:7 T:1: TFMR(2a12000980a84000) Timer Facility Error
[ 2853.104254431,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 2853.104254431,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc1]: P:8 C:7 T:3: TFMR(2a12000980a84000) Timer Facility Error
[ 2687.794321] opal: Hardware platform error: Unrecoverable HMI exception
[ 2688.909240] WARNING: CPU: 118 PID: 904 at /build/linux-GKZ1fU/linux-4.15.0/kernel/sched/core.c:1189 set_task_cpu+0x240/0x250
[ 2688.909365] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables devlink iptable_filter input_leds joydev idt_89hpesx mac_hid ofpart cmdlinepart powernv_flash at24 ipmi_powernv ipmi_devintf uio_pdrv_genirq ipmi_msghandler mtd uio opal_prd vmx_crypto ibmpowernv kvm_hv kvm sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure ast i2c_algo_bit hid_generic
[ 2688.910069]  ttm mpt3sas drm_kms_helper syscopyarea sysfillrect sysimgblt usbhid fb_sys_fops nvme hid crct10dif_vpmsum raid_class crc32c_vpmsum drm nvme_core i40e aacraid scsi_transport_sas
[ 2688.910249] CPU: 118 PID: 904 Comm: kworker/118:1 Not tainted 4.15.0-18-generic #19-Ubuntu
[ 2688.910327] Workqueue: events hmi_event_handler
[ 2688.910377] NIP:  c00000000014d6e0 LR: c00000000014e30c CTR: c00000000015a240
[ 2688.910450] REGS: c000203974b26f50 TRAP: 0700   Not tainted  (4.15.0-18-generic)
[ 2688.910536] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28008444  XER: 00000000
[ 2688.910634] CFAR: c00000000014d54c SOFTE: 0 
[ 2688.910634] GPR00: c00000000014e30c c000203974b271d0 c0000000016eae00 c000003e982d5900 
[ 2688.910634] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[ 2688.910634] GPR08: c000000001721ee0 0000000000000000 0000000000000000 0000000000005d04 
[ 2688.910634] GPR12: 0000000028008244 c00000000fad1200 c00000000013c788 c0002039759510c0 
[ 2688.910634] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000203993860000 
[ 2688.910634] GPR20: c000203994a235c0 0000000000000000 0000000000000000 c000203974b27350 
[ 2688.910634] GPR24: c000003e982d5d28 c00000000171dd78 c0000000011d8580 0000000000000000 
[ 2688.910634] GPR28: 0000000000000004 0000000000000000 0000000000000000 c000003e982d5900 
[ 2688.911257] NIP [c00000000014d6e0] set_task_cpu+0x240/0x250
[ 2688.911307] LR [c00000000014e30c] try_to_wake_up+0x1bc/0x660
[ 2688.911367] Call Trace:
[ 2688.911397] [c000203974b271d0] [c0000000011d8580] runqueues+0x0/0xc00 (unreliable)
[ 2688.911472] [c000203974b27210] [c00000000014e30c] try_to_wake_up+0x1bc/0x660
[ 2688.911548] [c000203974b27290] [c0000000001725d8] autoremove_wake_function+0x28/0x70
[ 2688.911622] [c000203974b272c0] [c000000000171b60] __wake_up_common+0xd0/0x200
[ 2688.911697] [c000203974b27330] [c000000000171d4c] __wake_up_common_lock+0xbc/0x110
[ 2688.911771] [c000203974b273c0] [c00000000018ea40] wake_up_klogd_work_func+0x60/0xc0
[ 2688.911846] [c000203974b273f0] [c000000000295d10] irq_work_run_list+0xb0/0x100
[ 2688.911937] [c000203974b27430] [c0000000001b7220] update_process_times+0x60/0x90
[ 2688.912011] [c000203974b27460] [c0000000001cef54] tick_sched_handle.isra.5+0x34/0xd0
[ 2688.912084] [c000203974b27490] [c0000000001cf050] tick_sched_timer+0x60/0xe0
[ 2688.912158] [c000203974b274d0] [c0000000001b7db4] __hrtimer_run_queues+0x144/0x370
[ 2688.912232] [c000203974b27550] [c0000000001b8d0c] hrtimer_interrupt+0xfc/0x350
[ 2688.912307] [c000203974b27620] [c0000000000248f0] __timer_interrupt+0x90/0x260
[ 2688.912382] [c000203974b27670] [c000000000024d08] timer_interrupt+0x98/0xe0
[ 2688.912446] [c000203974b276a0] [c000000000009014] decrementer_common+0x114/0x120
[ 2688.912522] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
[ 2688.912522]     LR = arch_local_irq_restore+0x74/0x90
[ 2688.912619] [c000203974b27990] [c000203974b279d0] 0xc000203974b279d0 (unreliable)
[ 2688.912695] [c000203974b279b0] [c000000000cff180] _raw_spin_unlock_irqrestore+0x40/0xa0
[ 2688.912770] [c000203974b279d0] [c00000000057d5fc] pstore_dump+0x31c/0x3e0
[ 2688.918818] [c000203974b27b10] [c00000000018ec64] kmsg_dump+0x134/0x1a0
[ 2688.925755] [c000203974b27b70] [c0000000000a2f14] pnv_platform_error_reboot+0x94/0x110
[ 2688.934081] [c000203974b27be0] [c0000000000a842c] hmi_event_handler+0x1bc/0x1c0
[ 2688.941023] [c000203974b27c90] [c000000000133958] process_one_work+0x298/0x5a0
[ 2688.947953] [c000203974b27d20] [c000000000133cf8] worker_thread+0x98/0x630
[ 2688.954876] [c000203974b27dc0] [c00000000013c928] kthread+0x1a8/0x1b0
[ 2688.961814] [c000203974b27e30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4
[ 2688.968751] Instruction dump:
[ 2688.971525] 7faa3670 7d4a0194 57a706be 7d4a07b4 794a1f24 7d28502a 7d293c36 71290001 
[ 2688.979834] 4082fe80 60000000 60000000 60420000 <0fe00000> 4bfffe6c 60000000 60420000 
[ 2688.988160] ---[ end trace c3fbb130bac0f441 ]---
[ 2855.374688141,0] OPAL: Reboot requested due to Platform error.
[ 2855.377232501,3] OPAL: Reboot requested due to Platform error.[ 2855.380086941,5] Software initiated checkstop disabled.
[ 2689.3640[ 2855.383112058,5] OPAL: Reboot request...
36] opal: Reboot type 1 not supported for Unrecoverable HMI exception

--== Welcome to Hostboot  ==--

  2.77291|secure|SecureROM valid - enabling functionality
  5.37968|Ignoring boot flags, incorrect version 0x0
  5.38625|Booting from SBE side 0 on master proc=00050000
  5.58026|ISTEP  6. 5 - host_init_fsi
  5.68847|ISTEP  6. 6 - host_set_ipl_parms
  5.94385|ISTEP  6. 7 - host_discover_targets
  7.74862|HWAS|PRESENT> DIMM[03]=F0F0000000000000
  7.74863|HWAS|PRESENT> Proc[05]=8800000000000000
  7.74864|HWAS|PRESENT> Core[07]=03FFCFCFCF0F0000
  7.94847|ISTEP  6. 8 - host_update_master_tpm
 15.48415|SECURE|Security Access Bit> 0xC000000000000000

OPAL level:

[   70.104446560,5] OPAL skiboot-v5.11-70-g5307c0ec7899-pc34e21f starting...
maheshsal commented 6 years ago

Looks like the TB resync has failed and hence the hmi recovery. TOD error will generate HMI on all the cores. Can you reproduce this issue with a TB error that generates HMI on one single core and then loop over to see after what count TB fails to re-sync ?

pridhiviraj commented 6 years ago

@maheshsal Yes, after 20k TB error's on a single core it is hitting the same issue.

0000080000000000
18517
0000080000000000
18518
0000080000000000
18519
0000080000000000
18520
[console-pexpect]#
[console-pexpect]#[    1.374114245,3] CHIPTOD: TB "Not Set" TOD in error state
[18014398511.108890053,4] HMI: Failed to get TB in running state! CPU=41, TFMR=2a12800980a44000
[18014398511.108890053,4] HMI: Failed to get TB in running state! CPU=43, TFMR=2a12800980a44000
[18014398511.108890053,4] HMI: Failed to get TB in running state! CPU=40, TFMR=2a12800980a44000
[18014398511.108890053,4] HMI: Failed to get TB in running state! CPU=42, TFMR=2a12800980a44000
[66901.841558] opal: Hardware platform error: Unrecoverable HMI exception
[66903.108438] WARNING: CPU: 42 PID: 826 at /build/linux-GKZ1fU/linux-4.15.0/kernel/sched/core.c:1189 set_task_cpu+0x240/0x250
[66903.108541] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter devlink ip6_tables iptable_filter joydev input_leds mac_hid ofpart idt_89hpesx cmdlinepart powernv_flash mtd ipmi_powernv ipmi_devintf vmx_crypto at24 uio_pdrv_genirq uio ipmi_msghandler ibmpowernv opal_prd kvm_hv kvm sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure ast i2c_algo_bit ttm hid_generic
[66903.109196]  drm_kms_helper mpt3sas syscopyarea sysfillrect sysimgblt fb_sys_fops nvme usbhid crct10dif_vpmsum raid_class hid crc32c_vpmsum drm nvme_core i40e aacraid scsi_transport_sas
[66903.109352] CPU: 42 PID: 826 Comm: kworker/42:1 Not tainted 4.15.0-18-generic #19-Ubuntu
[66903.109424] Workqueue: events hmi_event_handler
[66903.109471] NIP:  c00000000014d6e0 LR: c00000000014e30c CTR: c00000000015a240
[66903.109538] REGS: c0000000fe66af50 TRAP: 0700   Not tainted  (4.15.0-18-generic)
[66903.109603] MSR:  9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 28008424  XER: 00000000
[66903.109707] CFAR: c00000000014d54c SOFTE: 0 
[66903.109707] GPR00: c00000000014e30c c0000000fe66b1d0 c0000000016eae00 c00020395bcac900 
[66903.109707] GPR04: 0000000000000040 0000000000000040 0000000000000000 0000000000000000 
[66903.109707] GPR08: c000000001721ee0 0000000000000000 0000000000000008 00000000000748d0 
[66903.109707] GPR12: 0000000028008224 c00000000fa9ce00 c00000000013c788 c000003fe0b0ca80 
[66903.109707] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000003fee840000 
[66903.109707] GPR20: c000003fefa035c0 0000000000000000 0000000000000000 c0000000fe66b350 
[66903.109707] GPR24: c00020395bcacd28 c00000000171dd78 c0000000011d8580 0000000000000000 
[66903.109707] GPR28: 0000000000000004 0000000000000040 0000000000000040 c00020395bcac900 
[66903.110286] NIP [c00000000014d6e0] set_task_cpu+0x240/0x250
[66903.110332] LR [c00000000014e30c] try_to_wake_up+0x1bc/0x660
[66903.110387] Call Trace:
[66903.110415] [c0000000fe66b1d0] [c0000000011d8580] runqueues+0x0/0xc00 (unreliable)
[66903.110484] [c0000000fe66b210] [c00000000014e30c] try_to_wake_up+0x1bc/0x660
[66903.110554] [c0000000fe66b290] [c0000000001725d8] autoremove_wake_function+0x28/0x70
[66903.110622] [c0000000fe66b2c0] [c000000000171b60] __wake_up_common+0xd0/0x200
[66903.110691] [c0000000fe66b330] [c000000000171d4c] __wake_up_common_lock+0xbc/0x110
[66903.110760] [c0000000fe66b3c0] [c00000000018ea40] wake_up_klogd_work_func+0x60/0xc0
[66903.110830] [c0000000fe66b3f0] [c000000000295d10] irq_work_run_list+0xb0/0x100
[66903.110901] [c0000000fe66b430] [c0000000001b7220] update_process_times+0x60/0x90
[66903.110970] [c0000000fe66b460] [c0000000001cef54] tick_sched_handle.isra.5+0x34/0xd0
[66903.111037] [c0000000fe66b490] [c0000000001cf050] tick_sched_timer+0x60/0xe0
[66903.111120] [c0000000fe66b4d0] [c0000000001b7db4] __hrtimer_run_queues+0x144/0x370
[66903.111188] [c0000000fe66b550] [c0000000001b8d0c] hrtimer_interrupt+0xfc/0x350
[66903.111258] [c0000000fe66b620] [c0000000000248f0] __timer_interrupt+0x90/0x260
[66903.111327] [c0000000fe66b670] [c000000000024d08] timer_interrupt+0x98/0xe0
[66903.111387] [c0000000fe66b6a0] [c000000000009014] decrementer_common+0x114/0x120
[66903.111460] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
[66903.111460]     LR = arch_local_irq_restore+0x74/0x90
[66903.111551] [c0000000fe66b990] [c0000000fe66b9d0] 0xc0000000fe66b9d0 (unreliable)
[66903.111622] [c0000000fe66b9b0] [c000000000cff180] _raw_spin_unlock_irqrestore+0x40/0xa0
[66903.111691] [c0000000fe66b9d0] [c00000000057d5fc] pstore_dump+0x31c/0x3e0
[66903.118016] [c0000000fe66bb10] [c00000000018ec64] kmsg_dump+0x134/0x1a0
[66903.124952] [c0000000fe66bb70] [c0000000000a2f14] pnv_platform_error_reboot+0x94/0x110
[66903.131898] [c0000000fe66bbe0] [c0000000000a842c] hmi_event_handler+0x1bc/0x1c0
[66903.140209] [c0000000fe66bc90] [c000000000133958] process_one_work+0x298/0x5a0
[66903.147152] [c0000000fe66bd20] [c000000000133cf8] worker_thread+0x98/0x630
[66903.154074] [c0000000fe66bdc0] [c00000000013c928] kthread+0x1a8/0x1b0
[66903.159636] [c0000000fe66be30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4
[66903.167942] Instruction dump:
[66903.170729] 7faa3670 7d4a0194 57a706be 7d4a07b4 794a1f24 7d28502a 7d293c36 71290001 
[66903.177660] 4082fe80 60000000 60000000 60420000 <0fe00000> 4bfffe6c 60000000 60420000 
[66903.185965] ---[ end trace 739c2a2d69ccdc2b ]---
[67070.201983351,0] OPAL: Reboot requested due to Platform error.
[67070.205232792,3] OPAL: Reboot requested due to Platform error.[67070.208087613,5] Software initiated checkstop disabled.
[6[67070.210438942,5] OPAL: Reboot request...
6903.561762] opal: Reboot type 1 not supported for Unrecoverable HMI exception
pridhiviraj commented 6 years ago

On a second re-create also at the same count system got hung this time.

[   90.410845] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.416395]  ErroLOCK ERROR: Unlocked non-owned lock @0x30303698 (state: 0x0000004000000001)
[18014398766.211691469,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[18014398766.211691469,7] HMI: [Loc: UOPWR.1302LDA-Node0-Proc0]: P:0 C:16 T:3: TFMR(2a12000980a44000) Timer Facility Error
[  256.481751932,3] OPAL exiting with locks held, token=145 retval=0
[  256.481756229,3]   core/lock.c:216
[  256.481758491,3]   core/lock.c:216
[18014398766.211691469,0] Aborting!
r detail: TiCPU 0040 Backtrace:
 S: 0000000031e07910 R: 000000003001a4f8 E ._abort+0x4c
 S: 0000000031e07990 R: 0000000030017e88 E .lock_error+0x64
 S: 0000000031e07a10 R: 0000000030017988 E .unlock+0x60
 S: 0000000031e07a80 R: 0000000030017b40 E .lock_caller+0xf4
 S: 0000000031e07b30 R: 0000000030039b60 E .__uart_do_poll+0x40
 S: 0000000031e07c20 R: 000000003001ba70 E .opal_run_pollers+0x168
 S: 0000000031e07ca0 R: 000000003001bafc E .opal_poll_events+0x74
 S: 0000000031e07d20 R: 00000000300051e4 E opal_entry+0x134
 --- OPAL call token: 0xa caller R1: 0xc000003fe6a77a30 ---
mer facility experienced an error
[   90.422007]  HMER: 0840000000000000
[   90.426216]  TFMR: 2a12000980a44000
[   90.428871] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.435800]  Error detail: Timer facility experienced an error
[   90.441469]  HMER: 0840000000000000
[   90.444120]  TFMR: 2a12000980a54000
[   90.448281] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.453833]  Error detail: Timer facility experienced an error
[   90.459507]  HMER: 0840000000000000
[   90.463537]  TFMR: 2a12000980a54000
[   90.466319] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.473242]  Error detail: Timer facility experienced an error
[   90.478941]  HMER: 0840000000000000
[   90.481570]  TFMR: 2a12000980a44000
[   90.485722] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.491279]  Error detail: Timer facility experienced an error
[   90.496955]  HMER: 0840000000000000
[   9[  256.476915661,0] Assert fail: core/mem_region.c:444:lock_held_by_me(&region->free_list_lock)
0.500979]   TFMR: 2a12000980a44000
[   90.503752] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.509306]  Error detail: Timer facility experienced an error
[   90.516383]  HMER: 0840000000000000
[   90.519016]  TFMR: 2a12000980a44000
[   90.523163] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.528719]  Error detail: Timer facility experienced an error
[   90.534420]  HMER: 0840000000000000
[   90.538418]  TFMR: 2a12000980a54000
[   90.541197] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.546747]  Error detail: Timer facility experienced an error
[   90.553675]  HMER: 0840000000000000
[   90.556448]  TFMR: 2a12000980a44000
[   90.560606] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.566279]  Error detail: Timer facility experienced an error
[   90.571708]  HMER: 0840000000000000
[   90.574484]  TFMR: 2a12000980a54000
[   90.578635] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.584245]  Error detail: Timer facility experienced an error
[   90.589738]  HMER: 0840000000000000
[   90.593888]  TFMR: 2a12000980a54000
[   90.596664] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.603734]  Error detail: Timer facility experienced an error
[   90.609148]  HMER: 0840000000000000
[   90.611916]  TFMR: 2a12000980a44000
[   90.616103] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.621632]  Error detail: Timer facility experienced an error
[   90.627178]  HMER: 0840000000000000
[   90.631331]  TFMR: 2a12000980a44000
[   90.634145] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.641035]  Error detail: Timer facility experienced an error
[   90.646593]  HMER: 0840000000000000
[   90.649362]  TFMR: 2a12000980a54000
[   90.653689] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.659068]  Error detail: Timer facility experienced an error
[   90.664621]  HMER: 0840000000000000
[   90.668769]  TFMR: 2a12000980a44000
[   90.671584] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.677099]  Error detail: Timer facility experienced an error
[   90.684024]  HMER: 0840000000000000
[   90.686806]  TFMR: 2a12000980a44000
[   90.691016] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.696514]  Error detail: Timer facility experienced an error
[   90.702060]  HMER: 0840000000000000
[   90.706207]  TFMR: 2a12000980a54000
[   90.708994] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.714674]  Error detail: Timer facility experienced an error
[   90.720088]  HMER: 0840000000000000
[   90.724239]  TFMR: 2a12000980a44000
[   90.727020] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.733942]  Error detail: Timer facility experienced an error
[   90.739503]  HMER: 0840000000000000
[   90.742275]  TFMR: 2a12000980a44000
[   90.746464] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.751982]  Error detail: Timer facility experienced an error
[   90.757531]  HMER: 0840000000000000
[   90.761680]  TFMR: 2a12000980a44000
[   90.764552] Severe Hypervisor Maintenance interrupt [Recovered]
[   90.771385]  [  256.476915661,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 0 (sptr=00001ccc)
[18014398766.211691469,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 1 (sptr=00001ccc)
[  256.476915661,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 2 (sptr=00001ccc)
[18014398514.386672589,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 0 (sptr=00002ccc)
[18014398514.386672589,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 1 (sptr=00002ccc)
[    5.139896781,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 2 (sptr=00002ccc)
[    5.139896781,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 0 (sptr=00003ccc)
[    5.139896781,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 1 (sptr=00003ccc)
[18014398514.386672589,3] HMI: Rendez-vous stage 1 timeout, CPU 0x43 waiting for thread 2 (sptr=00003ccc)
[    5.139896781,4] HMI: Failed to get TB in running state! CPU=43, TFMR=2a12000980a44000

count:

0000080000000000
18282
0000080000000000
18283
0000080000000000
ghost commented 6 years ago

Paidipeddi Pridhiviraj notifications@github.com writes:

@maheshsal Yes, after 20k TB error's it is hitting the same issue on single core itself.

20,000 TB errors? Wow, we survive a lot of them!

I'd argue that this is something to look at after GA, if you've got that many TB errors, you possible have larger problems :)

-- Stewart Smith OPAL Architect, IBM.