Create Cntl+C/Abort Testcase for SNAP

fhaverkamp commented 6 years ago

Let us create a script starting a snap_memcopy in parallel multiple times and doing an abort in the middle of that operation. The assumption is that we can use that to test the stability of the hardware/software implementation.

Goal is that the card is fine after running that test case, no reset required, no system reset needed.

If we can survive that test case we can move that to the Nimbix scenario and ensure that that is running stable too.

fhaverkamp commented 6 years ago

I added a branch to perform some experiments: https://github.com/open-power/snap/tree/SIM_Sudden_DEATH_Testcase

acastellane commented 6 years ago

Sounds great. Will you add a "ctl c" catch routine to the snap_mem_copy application under test ? to make it work on line ? thanks.

fhaverkamp commented 6 years ago

haver@tul3eth3:~/snap$ for ((i=0;i<100;i++)); do echo "ITERATION $i ........" ; ./actions/hdl_example/tests/10140000_kill_test.sh -k2 -p64 -c1 -r ; done ITERATION 0 ........

Parallel TEST for card 0 starting 64 processes

Resetting card 0 (takes a while) .................... OK Check if card reappeared ... OK [main] Enter [snap_version] Enter SNAP on N250S Card, NVME enabled, 4096 MB SRAM available. SNAP FPGA Release: v1.2.2 Distance: 1 GIT: 0x2844be6a SNAP FPGA Build (Y/M/D): 2017/12/06 Time (H:M): 09:12 SNAP FPGA CIR Master: 1 My ID: 0 SNAP FPGA Up Time: 8 sec [snap_version] Exit [snap_m_init] Enter [unlock_action] Enter [unlock_action] Exit found Action: 0x10140000 [unlock_action] Enter [unlock_action] Exit found Action: 0x10140000 0 Max AT: 1 Found AT: 0x10140000 --> Assign Short AT: 0 0 0x10140000 0x00000000 IBM SNAP framework example

[snap_m_init] Exit rc: 0 [main] Exit rc: 0 (1) Check if card is replying to a single memcopy request (1) ... Thu Dec 7 11:10:39 CET 2017 (2) Perform massive stress and killing applications ... Starting snap_example in the background ... ok Waiting 2 seconds ... ... ok Sending SIGKILL to all ... Running: 24589 24590 24591 24592 24593 24594 24595 24596 24597 24598 24599 24600 24601 24602 24603 24604 24605 24606 24607 24608 24609 24610 24611 24612 24613 24614 24615 24616 24617 24618 24619 24620 24621 24622 24623 24624 24625 24626 24627 24628 24629 24630 24631 24632 24633 24634 24635 24636 24637 24638 24639 24640 24641 24642 24643 24644 24645 24646 24647 24648 24649 24650 24651 24652 Expected: 24589 24590 24591 24592 24593 24594 24595 24596 24597 24598 24599 24600 24601 24602 24603 24604 24605 24606 24607 24608 24609 24610 24611 24612 24613 24614 24615 24616 24617 24618 24619 24620 24621 24622 24623 24624 24625 24626 24627 24628 24629 24630 24631 24632 24633 24634 24635 24636 24637 24638 24639 24640 24641 24642 24643 24644 24645 24646 24647 24648 24649 24650 24651 24652 Still running: ok (3) Check logfiles for string "err" ... (4) Check if card is still replying to a single memcopy request ... Error: Can not attach Action: 10140000 Try to run snap_main tool err: Single snap_example took to long, please review results! ITERATION 1 ........

Parallel TEST for card 0 starting 64 processes

Resetting card 0 (takes a while) .................... OK Check if card reappeared ... OK [main] Enter [snap_version] Enter SNAP on N250S Card, NVME enabled, 4096 MB SRAM available. SNAP FPGA Release: v1.2.2 Distance: 1 GIT: 0x2844be6a SNAP FPGA Build (Y/M/D): 2017/12/06 Time (H:M): 09:12 SNAP FPGA CIR Master: 1 My ID: 0 SNAP FPGA Up Time: 8 sec [snap_version] Exit [snap_m_init] Enter [unlock_action] Enter [unlock_action] Exit found Action: 0x10140000 [unlock_action] Enter [unlock_action] Exit found Action: 0x10140000 0 Max AT: 1 Found AT: 0x10140000 --> Assign Short AT: 0 0 0x10140000 0x00000000 IBM SNAP framework example

[snap_m_init] Exit rc: 0 [main] Exit rc: 0 (1) Check if card is replying to a single memcopy request (1) ... Thu Dec 7 11:11:20 CET 2017 (2) Perform massive stress and killing applications ... Starting snap_example in the background ... ok Waiting 2 seconds ... ... ok Sending SIGKILL to all ... Running: 24766 24767 24768 24769 24770 24771 24772 24773 24774 24775 24776 24777 24778 24779 24780 24781 24782 24783 24784 24785 24786 24787 24788 24789 24790 24791 24792 24793 24794 24795 24796 24797 24798 24799 24800 24801 24802 24803 24804 24805 24806 24807 24808 24809 24810 24811 24812 24813 24814 24815 24816 24817 24818 24819 24820 24821 24822 24823 24824 24825 24826 24827 24828 24829 Expected: 24766 24767 24768 24769 24770 24771 24772 24773 24774 24775 24776 24777 24778 24779 24780 24781 24782 24783 24784 24785 24786 24787 24788 24789 24790 24791 24792 24793 24794 24795 24796 24797 24798 24799 24800 24801 24802 24803 24804 24805 24806 24807 24808 24809 24810 24811 24812 24813 24814 24815 24816 24817 24818 24819 24820 24821 24822 24823 24824 24825 24826 24827 24828 24829 Still running: ok (3) Check logfiles for string "err" ... (4) Check if card is still replying to a single memcopy request ... Error: Can not attach Action: 10140000 Try to run snap_main tool err: Single snap_example took to long, please review results! ITERATION 2 ........

Parallel TEST for card 0 starting 64 processes

Resetting card 0 (takes a while) .................... OK Check if card reappeared ... OK [main] Enter [snap_version] Enter SNAP on N250S Card, NVME enabled, 4096 MB SRAM available. SNAP FPGA Release: v1.2.2 Distance: 1 GIT: 0x2844be6a SNAP FPGA Build (Y/M/D): 2017/12/06 Time (H:M): 09:12 SNAP FPGA CIR Master: 1 My ID: 0 SNAP FPGA Up Time: 8 sec [snap_version] Exit [snap_m_init] Enter [unlock_action] Enter [unlock_action] Exit found Action: 0x10140000 [unlock_action] Enter [unlock_action] Exit found Action: 0x10140000 0 Max AT: 1 Found AT: 0x10140000 --> Assign Short AT: 0 0 0x10140000 0x00000000 IBM SNAP framework example

[snap_m_init] Exit rc: 0 [main] Exit rc: 0 (1) Check if card is replying to a single memcopy request (1) ... Thu Dec 7 11:12:02 CET 2017 (2) Perform massive stress and killing applications ... Starting snap_example in the background ... ok Waiting 2 seconds ... ... ok Sending SIGKILL to all ... Running: 24888 24889 24890 24891 24892 24893 24894 24895 24896 24897 24898 24899 24900 24901 24902 24903 24904 24905 24906 24907 24908 24909 24910 24911 24912 24913 24914 24915 24916 24917 24918 24919 24920 24921 24922 24923 24924 24925 24926 24927 24928 24929 24930 24931 24932 24933 24934 24935 24936 24937 24938 24939 24940 24941 24942 24943 24944 24945 24946 24947 24948 24949 24950 24951 Expected: 24888 24889 24890 24891 24892 24893 24894 24895 24896 24897 24898 24899 24900 24901 24902 24903 24904 24905 24906 24907 24908 24909 24910 24911 24912 24913 24914 24915 24916 24917 24918 24919 24920 24921 24922 24923 24924 24925 24926 24927 24928 24929 24930 24931 24932 24933 24934 24935 24936 24937 24938 24939 24940 24941 24942 24943 24944 24945 24946 24947 24948 24949 24950 24951 Still running: ok (3) Check logfiles for string "err" ... (4) Check if card is still replying to a single memcopy request ... Error: Can not attach Action: 10140000 Try to run snap_main tool err: Single snap_example took to long, please review results! ITERATION 3 ........

Parallel TEST for card 0 starting 64 processes

Resetting card 0 (takes a while) .................... OK Check if card reappeared ... OK [main] Enter [snap_version] Enter SNAP on N250S Card, NVME enabled, 4096 MB SRAM available. SNAP FPGA Release: v1.2.2 Distance: 1 GIT: 0x2844be6a SNAP FPGA Build (Y/M/D): 2017/12/06 Time (H:M): 09:12 SNAP FPGA CIR Master: 1 My ID: 0 SNAP FPGA Up Time: 8 sec [snap_version] Exit [snap_m_init] Enter [unlock_action] Enter [unlock_action] Exit found Action: 0x10140000 [unlock_action] Enter [unlock_action] Exit found Action: 0x10140000 0 Max AT: 1 Found AT: 0x10140000 --> Assign Short AT: 0 0 0x10140000 0x00000000 IBM SNAP framework example

[snap_m_init] Exit rc: 0 [main] Exit rc: 0 (1) Check if card is replying to a single memcopy request (1) ... Thu Dec 7 11:12:44 CET 2017

After 3 iterations the system faded away without leaving much of a trace. Please have a look if someone likes to repro this e.g. in the cloud. Command line for repro, see above.

fhaverkamp commented 6 years ago

haver@tul3eth3:~$ uptime
 11:28:18 up 11 min,  1 user,  load average: 0.00, 0.01, 0.00
haver@tul3eth3:~$

Magic recovery without FSP involvement happened. My assumption that an FSP reset would be required was wrong in this case. Let us try to repro more.

fhaverkamp commented 6 years ago

During test run we see following messages on the systems console:

tul3eth3 login: [ 2069.783285] PCI: Memory resource 0 not set for host bridge /pciex@3fffe40400000/pci@0/device@0 (domain 7)
[ 2092.394986] cxl afu0.0: PSL_SERR_An: 0x0800000000002801
[ 2092.395049] cxl afu0.0: PSL_FIR_SLICE_An: 0x0000000000000000
[ 2092.395125] cxl afu0.0: CXL_PSL_ErrStat_An: 0x0000000000000000
[ 2092.395189] cxl afu0.0: CXL_PSL_AFU_DEBUG_An: 0x0000000000000000
[ 2099.770628] Fatal Hypervisor Maintenance interrupt [Recovered]
[ 2099.770723]  Error detail: Malfunction Alert
[ 2099.770740]  HMER: 8040000000000000
[ 2099.770801]  Unknown Core check stop.
[ 2099.770853] Fatal Hypervisor Maintenance interrupt [Recovered]
[ 2099.770930]  Error detail: Malfunction Alert
[ 2099.770994]  HMER: 8040000000000000
[ 2099.771040]  Unknown Core check stop.
[ 2099.771101] Fatal Hypervisor Maintenance interrupt [Recovered]
[ 2099.771177]  Error detail: Malfunction Alert
[ 2099.771232]  HMER: 8040000000000000
[ 2099.771270]  Unknown Core check stop.
[ 2099.771314] Fatal Hypervisor Maintenance interrupt [Recovered]
[ 2099.771362] EEH: Fenced PHB#2 detected, location: N/A
[ 2099.771427]  Error detail: Malfunction Alert
[ 2099.771478]  HMER: 8040000000000000
[ 2099.771516]  Unknown Core check stop.
[ 2111.494913] PCI: Memory resource 0 not set for host bridge /pciex@3fffe40400000/pci@0/device@0 (domain 7)
[ 2134.112579] cxl afu0.0: PSL_SERR_An: 0x0800000000002801
[ 2134.112658] cxl afu0.0: PSL_FIR_SLICE_An: 0x0000000000000000
[ 2134.112723] cxl afu0.0: CXL_PSL_ErrStat_An: 0x0000000000000000
[ 2134.112788] cxl afu0.0: CXL_PSL_AFU_DEBUG_An: 0x0000000000000000
[ 2141.482695] Fatal Hypervisor Maintenance interrupt [Recovered]
[ 2141.482818]  Error detail: Malfunction Alert
[ 2141.482893]  HMER: 8040000000000000
[ 2141.482953]  Unknown Core check stop.
[ 2141.483033] Fatal Hypervisor Maintenance interrupt [Recovered]
[ 2141.483125]  Error detail: Malfunction Alert
[ 2141.483200]  HMER: 8040000000000000
[ 2141.483247]  Unknown Core check stop.
[ 2141.483286] Fatal Hypervisor Maintenance interrupt [Recovered]
[ 2141.483361]  Error detail: Malfunction Alert
[ 2141.483420]  HMER: 8040000000000000
[ 2141.483432]  Unknown Core check stop.
[ 2141.483483] Fatal Hypervisor Maintenance interrupt [Recovered]
[ 2141.483552]  Error detail: Malfunction Alert
[ 2141.483608]  HMER: 8040000000000000
[ 2141.483649]  Unknown Core check stop.
[ 2141.483889] EEH: Fenced PHB#2 detected, location: N/A
...

fhaverkamp commented 6 years ago

screen shot 2017-12-07 at 15 27 58

Here a case where after a while running my test case the system seemed to have rebooted.

fhaverkamp commented 6 years ago

And here the potential "Übeltäter":

[ 2612.583139] EEH: Notify device driver to resume
[ 2635.198410] CXL SLICE ERROR interrupt 456
[ 2635.198428] ------------[ cut here ]------------
[ 2635.198430] WARNING: at /build/linux-1xsttn/linux-4.4.0/drivers/misc/cxl/native.c:866
[ 2635.198432] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables vmx_crypto ibmpowernv ipmi_powernv powernv_rng ipmi_msghandler uio_pdrv_genirq leds_powernv uio autofs4 ses enclosure scsi_transport_sas cxl ipr
[ 2635.198460] CPU: 120 PID: 0 Comm: swapper/120 Tainted: G        W       4.4.0-98-generic #121-Ubuntu
[ 2635.198463] task: c000003ff233ddf0 ti: c000003fffbc0000 task.ti: c000003ff23fc000
[ 2635.198465] NIP: d00000001c324ea4 LR: d00000001c324ea0 CTR: c0000000005a15a0
[ 2635.198467] REGS: c000003fffbc3af0 TRAP: 0700   Tainted: G        W        (4.4.0-98-generic)
[ 2635.198469] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28004484  XER: 20000000
[ 2635.198476] CFAR: c000000000b18950 SOFTE: 0 
               GPR00: d00000001c324ea0 c000003fffbc3d70 d00000001c3507a8 000000000000001d 
               GPR04: c000000ffa609c50 c000000ffa61b4e0 000000000000002d 0000000000000b83 
               GPR08: 0000000000000007 0000000000000000 0000000ff9640000 0000000000000006 
               GPR12: 0000000000004400 c00000000fb87400 c00000000163ef18 c000000000fdd698 
               GPR16: 0000000000000001 c000003ff23fc080 c000000000fc9c28 0000000000000008 
               GPR20: c000003ff23fc000 c000000001588778 0000000000000001 0000000000000002 
               GPR24: 00000000000001c8 c000001ff116d800 0000000000000000 0000000000000000 
               GPR28: 0000000000000000 00000000000001c8 00000000000001c8 c000001fed2a1000 
[ 2635.198506] NIP [d00000001c324ea4] native_slice_irq_err+0x324/0x330 [cxl]
[ 2635.198513] LR [d00000001c324ea0] native_slice_irq_err+0x320/0x330 [cxl]
[ 2635.198515] Call Trace:
[ 2635.198522] [c000003fffbc3d70] [d00000001c324ea0] native_slice_irq_err+0x320/0x330 [cxl] (unreliable)
[ 2635.198526] [c000003fffbc3e10] [c000000000131880] handle_irq_event_percpu+0xa0/0x320
[ 2635.198529] [c000003fffbc3ed0] [c000000000131b68] handle_irq_event+0x68/0xc0
[ 2635.198533] [c000003fffbc3f00] [c000000000136e8c] handle_fasteoi_irq+0xec/0x2b0
[ 2635.198536] [c000003fffbc3f30] [c000000000130a64] generic_handle_irq+0x54/0x80
[ 2635.198539] [c000003fffbc3f60] [c000000000011260] __do_irq+0x80/0x1d0
[ 2635.198542] [c000003fffbc3f90] [c000000000024860] call_do_irq+0x14/0x24
[ 2635.198545] [c000003ff23ffa70] [c000000000011448] do_IRQ+0x98/0x140
[ 2635.198548] [c000003ff23ffac0] [c000000000002594] hardware_interrupt_common+0x114/0x180
[ 2635.198552] --- interrupt: 501 at arch_local_irq_restore+0x5c/0x90
                   LR = arch_local_irq_restore+0x40/0x90
[ 2635.198555] [c000003ff23ffdb0] [c000000ffa61d6a0] 0xc000000ffa61d6a0 (unreliable)
[ 2635.198558] [c000003ff23ffdd0] [c0000000009349c8] cpuidle_enter_state+0x1a8/0x410
[ 2635.198561] [c000003ff23ffe30] [c00000000011ac78] call_cpuidle+0x78/0xd0
[ 2635.198565] [c000003ff23ffe70] [c00000000011b04c] cpu_startup_entry+0x37c/0x490
[ 2635.198568] [c000003ff23fff30] [c0000000000456ec] start_secondary+0x33c/0x360
[ 2635.198571] [c000003ff23fff90] [c000000000008b6c] start_secondary_prolog+0x10/0x14
[ 2635.198573] Instruction dump:
[ 2635.198575] 8129fff8 2f890001 419efe6c 4bfffea4 60000000 60000000 60420000 3ce20000 
[ 2635.198580] 7fc4f378 e8678298 4801980d e8410018 <0fe00000> 4bfffd20 60420000 3c4c0003 
[ 2635.198586] ---[ end trace 6d002e155a649714 ]---
[ 2635.198593] cxl afu0.0: PSL_SERR_An: 0x0800000000002801
[ 2635.198657] cxl afu0.0: PSL_FIR_SLICE_An: 0x0000000000000000
[ 2635.198694] cxl afu0.0: CXL_PSL_ErrStat_An: 0x0000000000000000
[ 2635.198706] cxl afu0.0: CXL_PSL_AFU_DEBUG_An: 0x0000000000000000
[ 2640.076804] cxl-pci 0002:01:00.0: CXL reset
[ 2642.571567] Fatal Hypervisor Maintenance interrupt [Recovered]
[ 2642.571728]  Error detail: Malfunction Alert
[ 2642.571806]  HMER: 8040000000000000
[ 2642.571862]  Unknown Core check stop.
[ 2642.571884] EEH: Fenced PHB#2 detected, location: N/A
[ 2642.571982] Fatal Hypervisor Maintenance interrupt [Recovered]
[ 2642.572058]  Error detail: Malfunction Alert
[ 2642.572119]  HMER: 8040000000000000
[ 2642.572131]  Unknown Core check stop.
[ 2642.572177] Harmless Hypervisor Maintenance interrupt [Recovered]
[ 2642.572179]  Error detail: Unknown

fhaverkamp commented 6 years ago

haver@tul3eth3:~$ uname -r 4.4.0-98-generic

haver@tul3eth3:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.3 LTS Release: 16.04 Codename: xenial

I hope we have a recent kernel on that box. It should be Ubuntu.

fhaverkamp commented 6 years ago

[ 2593.485180] ------------[ cut here ]------------
[ 2593.485182] WARNING: at /build/linux-1xsttn/linux-4.4.0/drivers/misc/cxl/native.c:866
[ 2593.485184] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables vmx_crypto ibmpowernv ipmi_powernv powernv_rng ipmi_msghandler uio_pdrv_genirq leds_powernv uio autofs4 ses enclosure scsi_transport_sas cxl ipr
[ 2593.485212] CPU: 120 PID: 0 Comm: swapper/120 Tainted: G        W       4.4.0-98-generic #121-Ubuntu
[ 2593.485215] task: c000003ff233ddf0 ti: c000003fffbc0000 task.ti: c000003ff23fc000
[ 2593.485217] NIP: d00000001c324ea4 LR: d00000001c324ea0 CTR: c0000000005a15a0
[ 2593.485219] REGS: c000003fffbc3af0 TRAP: 0700   Tainted: G        W        (4.4.0-98-generic)
[ 2593.485221] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28004484  XER: 20000000
[ 2593.485228] CFAR: c000000000b18950 SOFTE: 0 
               GPR00: d00000001c324ea0 c000003fffbc3d70 d00000001c3507a8 000000000000001d 
               GPR04: c000000ffa609c50 c000000ffa61b4e0 000000000000002d 0000000000000b10 
               GPR08: 0000000000000007 0000000000000000 0000000ff9640000 0000000000000006 
               GPR12: 0000000000004400 c00000000fb87400 c00000000163ef18 c000000000fdd698 
               GPR16: 0000000000000001 c000003ff23fc080 c000000000fc9c28 0000000000000008 
               GPR20: c000003ff23fc000 c000000001588778 0000000000000001 0000000000000002 
               GPR24: 00000000000001c8 c000001ff116d800 0000000000000000 0000000000000000 
               GPR28: 0000000000000000 00000000000001c8 00000000000001c8 c0000000fc060800 
[ 2593.485258] NIP [d00000001c324ea4] native_slice_irq_err+0x324/0x330 [cxl]
[ 2593.485266] LR [d00000001c324ea0] native_slice_irq_err+0x320/0x330 [cxl]
[ 2593.485267] Call Trace:
[ 2593.485274] [c000003fffbc3d70] [d00000001c324ea0] native_slice_irq_err+0x320/0x330 [cxl] (unreliable)
[ 2593.485278] [c000003fffbc3e10] [c000000000131880] handle_irq_event_percpu+0xa0/0x320
[ 2593.485281] [c000003fffbc3ed0] [c000000000131b68] handle_irq_event+0x68/0xc0
[ 2593.485285] [c000003fffbc3f00] [c000000000136e8c] handle_fasteoi_irq+0xec/0x2b0
[ 2593.485288] [c000003fffbc3f30] [c000000000130a64] generic_handle_irq+0x54/0x80
[ 2593.485291] [c000003fffbc3f60] [c000000000011260] __do_irq+0x80/0x1d0
[ 2593.485294] [c000003fffbc3f90] [c000000000024860] call_do_irq+0x14/0x24
[ 2593.485297] [c000003ff23ffa70] [c000000000011448] do_IRQ+0x98/0x140
[ 2593.485300] [c000003ff23ffac0] [c000000000002594] hardware_interrupt_common+0x114/0x180
[ 2593.485304] --- interrupt: 501 at arch_local_irq_restore+0x5c/0x90
                   LR = arch_local_irq_restore+0x40/0x90
[ 2593.485307] [c000003ff23ffdb0] [c000000ffa61d6a0] 0xc000000ffa61d6a0 (unreliable)
[ 2593.485311] [c000003ff23ffdd0] [c0000000009349c8] cpuidle_enter_state+0x1a8/0x410
[ 2593.485314] [c000003ff23ffe30] [c00000000011ac78] call_cpuidle+0x78/0xd0
[ 2593.485317] [c000003ff23ffe70] [c00000000011b04c] cpu_startup_entry+0x37c/0x490
[ 2593.485321] [c000003ff23fff30] [c0000000000456ec] start_secondary+0x33c/0x360
[ 2593.485324] [c000003ff23fff90] [c000000000008b6c] start_secondary_prolog+0x10/0x14
[ 2593.485325] Instruction dump:
[ 2593.485327] 8129fff8 2f890001 419efe6c 4bfffea4 60000000 60000000 60420000 3ce20000 
[ 2593.485333] 7fc4f378 e8678298 4801980d e8410018 <0fe00000> 4bfffd20 60420000 3c4c0003 
[ 2593.485339] ---[ end trace 6d002e155a649713 ]---
[ 2593.485346] cxl afu0.0: PSL_SERR_An: 0x0800000000002801
[ 2593.485408] cxl afu0.0: PSL_FIR_SLICE_An: 0x0000000000000000
[ 2593.485444] cxl afu0.0: CXL_PSL_ErrStat_An: 0x0000000000000000
[ 2593.485456] cxl afu0.0: CXL_PSL_AFU_DEBUG_An: 0x0000000000000000
[ 2598.357820] cxl-pci 0002:01:00.0: CXL reset
[ 2600.859128] Fatal Hypervisor Maintenance interrupt [Recovered]

There is more of that and finally the system reboots.

fbarrat commented 6 years ago

From above, it seems the first sign of things going wrong is an EEH (timestamp 2612). We only have the last trace of it, i.e. the driver is resuming operation. I don't know the cause of the EEH. It could be a previous card reset. I can't tell.

Then the cxl driver receives a slice error interrupt:

[ 2612.583139] EEH: Notify device driver to resume [ 2635.198410] CXL SLICE ERROR interrupt 456 [ 2635.198428] ------------[ cut here ]------------ [ 2635.198430] WARNING: at /build/linux-1xsttn/linux-4.4.0/drivers/misc/cxl/native.c:866 [ 2635.198432] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables vmx_crypto ibmpowernv ipmi_powernv powernv_rng ipmi_msghandler uio_pdrv_genirq leds_powernv uio autofs4 ses enclosure scsi_transport_sas cxl ipr [ 2635.198460] CPU: 120 PID: 0 Comm: swapper/120 Tainted: G W 4.4.0-98-generic #121-Ubuntu [ 2635.198463] task: c000003ff233ddf0 ti: c000003fffbc0000 task.ti: c000003ff23fc000 [ 2635.198465] NIP: d00000001c324ea4 LR: d00000001c324ea0 CTR: c0000000005a15a0 [ 2635.198467] REGS: c000003fffbc3af0 TRAP: 0700 Tainted: G W (4.4.0-98-generic) [ 2635.198469] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28004484 XER: 20000000 [ 2635.198476] CFAR: c000000000b18950 SOFTE: 0 GPR00: d00000001c324ea0 c000003fffbc3d70 d00000001c3507a8 000000000000001d GPR04: c000000ffa609c50 c000000ffa61b4e0 000000000000002d 0000000000000b83 GPR08: 0000000000000007 0000000000000000 0000000ff9640000 0000000000000006 GPR12: 0000000000004400 c00000000fb87400 c00000000163ef18 c000000000fdd698 GPR16: 0000000000000001 c000003ff23fc080 c000000000fc9c28 0000000000000008 GPR20: c000003ff23fc000 c000000001588778 0000000000000001 0000000000000002 GPR24: 00000000000001c8 c000001ff116d800 0000000000000000 0000000000000000 GPR28: 0000000000000000 00000000000001c8 00000000000001c8 c000001fed2a1000 [ 2635.198506] NIP [d00000001c324ea4] native_slice_irq_err+0x324/0x330 [cxl] [ 2635.198513] LR [d00000001c324ea0] native_slice_irq_err+0x320/0x330 [cxl] [ 2635.198515] Call Trace: [ 2635.198522] [c000003fffbc3d70] [d00000001c324ea0] native_slice_irq_err+0x320/0x330 [cxl] (

[ ... cutting the middle of the stack back trace... ]

[ 2635.198586] ---[ end trace 6d002e155a649714 ]--- [ 2635.198593] cxl afu0.0: PSL_SERR_An: 0x0800000000002801 [ 2635.198657] cxl afu0.0: PSL_FIR_SLICE_An: 0x0000000000000000 [ 2635.198694] cxl afu0.0: CXL_PSL_ErrStat_An: 0x0000000000000000

A slice error interrupt is sent by the PSL when it detects a problem with the AFU. Typically a misbehaving AFU. The details are to be found in the PSL_SERR_An register. A later version of the driver prints a message decoding it, but not on that version of ubuntu. Here it is: PSL_SERR_An = 0x0800000000002801 => bit 4 is set. From the PSL doc, the cause is:

"Bad Context Handle. In AFU_Directed Mode only, the accelerator issued a request using a bad context handle. (The retrieved SWSTATE is not valid , PSL_SPAP[V]=0). The request received a context response so this notification is redundant and should be masked. In other modes this bit is reserved."

We are in AFU directed mode, so the comment applies. You would need to consult with a PSL expert here, but it seems that the AFU is trying to use a dead context.

Not sure if it is the cause, but since you mentioned the hw may not handle the LLCMD handshake correctly, be aware that as soon as the driver sees an ack from the AFU about the termination, the context is dead/removed from the host. It's really important to be sure that the AFU will not try to use the context after the termination LLCMD.

After the slice error, it seems the driver/adapter was reset. It's not clear from the log what happens after that, but after 6 resets within an hour, the EEH will consider the card bad and remove it from the system (that behavior can be tuned if needed, but you must have heard it before). The node rebooting is not expected, of course. That needs to be investigated.

To summarize, there seems to be 2 sources of problem here:

the AFU is causing a slice error
the node is rebooting after a certain number of card resets (to be verified?). The first thing to do here is enable xmon to catch if it is a linux crash.

fhaverkamp commented 6 years ago

[ 3597.925087] Unknown Core check stop. [ 3597.925151] Fatal Hypervisor Maintenance interrupt [Recovered] [ 3597.925230] Error detail: Malfunction Alert [ 3597.925296] HMER: 8040000000000000 [ 3597.925345] Unknown Core check stop. [ 3609.540442] PCI: Memory resource 0 not set for host bridge /pciex@3fffe40400000/pci@0/device@0 (domain 7) [ 0.000000] opal: OPAL V3 detected ! [ 0.000000] Using PowerNV machine description [ 0.000000] Page sizes from device-tree: [ 0.000000] base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0 [ 0.000000] base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7 [ 0.000000] base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56 [ 0.000000] base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1 [ 0.000000] base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8 [ 0.000000] base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0 [ 0.000000] base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3 [ 0.000000] Using 1TB segments [ 0.000000] Found initrd at 0xc000000028000000:0xc00000002865f1d0

I think this is once it starts rebooting. It takes 15min to repro the problem. @fbarrat I think the access of the AFU to a dead context is basically expected. The AFU is doing DMA operations on behalf of the dying process and since the handshake is not yet implemented, it seems logical to me that we see that. I assume that it is blocked and that normal operation could continue later on. The sudden reboots is what worries me a bit.

The handshake procedure is going to be fixed once our hardware folks have figured out what they need to change to implement it.

fhaverkamp commented 6 years ago

@fbarrat The reset of the card is part of the test-procedure to recover it to normal operation. The EEH retry I am setting to a really high number. See also the test case.

See here too: https://github.com/open-power/snap/blob/SIM_Sudden_DEATH_Testcase/actions/hdl_example/tests/10140000_kill_test.sh

fbarrat commented 6 years ago

the "Unknown Core check stop" is meaningless in itself. It's just the HMI handler not finding the cause of the HMI. I seem to remember we used to see that message during a capi card reset (it has been fixed on recent kernels), because an HMI is raised in that situation.

We still see a few messages after, so something else happened just after. Either a kernel crash, or a real system checkstop, i.e. abrupt death

fhaverkamp commented 6 years ago

I just merged the test-case into master: https://github.com/open-power/snap/pull/538

Please use that instead of the branch, if you like to repro what I did.

fhaverkamp commented 6 years ago

@acastellane joined group of persons who can reboot a System p with help of my test-script ;-).

fhaverkamp commented 6 years ago

Frederick requested to run the experiment with xmon enabled, such that he can figure out if the reboot is triggered by a kernel problem or by a machine check. Let's see...

fhaverkamp commented 6 years ago

Today I tried Svens latest bitstream and did some manual testing. Here is what I saw so far:

haver@tul3eth3:~/snap$ ./actions/hdl_example/sw/snap_example -C0 -a2 -S100
haver@tul3eth3:~/snap$ ./actions/hdl_example/sw/snap_example -C0 -a2 -S10000
haver@tul3eth3:~/snap$ ./actions/hdl_example/sw/snap_example -C0 -a2 -S100000
haver@tul3eth3:~/snap$ ./actions/hdl_example/sw/snap_example -C0 -a2 -S1000000
FAILED: posix_memalign(): Cannot allocate memory
haver@tul3eth3:~/snap$ ./actions/hdl_example/sw/snap_example -C0 -a2 -S10000
haver@tul3eth3:~/snap$ ./actions/hdl_example/sw/snap_example -C0 -a2 -S10000 -N10
haver@tul3eth3:~/snap$ ./actions/hdl_example/sw/snap_example -C0 -a2 -S10000 -N1000
^C

haver@tul3eth3:~/snap$ 
haver@tul3eth3:~/snap$ 
haver@tul3eth3:~/snap$ 
haver@tul3eth3:~/snap$ ./actions/hdl_example/sw/snap_example -C0 -a2 -S10000 -N1
ERROR: snap_card_alloc_dev(/dev/cxl/afu0.0s)
haver@tul3eth3:~/snap$ ./software/tools/snap_maint -C0 -vv
[main] Enter
[main] Exit rc: 19
haver@tul3eth3:~/snap$ ll /dev/cxl/afu0.0
afu0.0   afu0.0m  afu0.0s  
haver@tul3eth3:~/snap$ ll /dev/cxl/afu0.0*
lrwxrwxrwx 1 root root      7 Jan 16 10:44 /dev/cxl/afu0.0 -> afu0.0s
crw-rw---- 1 root cxl  245, 2 Jan 16 10:44 /dev/cxl/afu0.0m
crw-rw-rw- 1 root cxl  245, 3 Jan 16 10:44 /dev/cxl/afu0.0s
haver@tul3eth3:~/snap$ ./software/tools/snap_maint -C0 -vv
[main] Enter
[main] Exit rc: 19
haver@tul3eth3:~/snap$ SNAP_TRACE=0xf ./software/tools/snap_maint -C0 -vv
[main] Enter
D hw_snap_card_alloc_dev Enter /dev/cxl/afu0.0m
D   hw_snap_card_alloc_dev: errinfo_size: 4096 VendorID: ffff DeviceID: ffff
D hw_snap_card_alloc_dev Exit Err
[main] Exit rc: 19
haver@tul3eth3:~/snap$

I increased the size gradually to get a long enough runtime to press Cntl+C manually. After aborting the data transfer that way the card turned to be unusable and not even snap_maint was capable to recover it. Just echo 1 > reset might help here.

haver@tul3eth3:~/snap$ ./actions/hdl_example/sw/snap_example -C0 -a2 -S10000 -N1
haver@tul3eth3:~/snap$ ./actions/hdl_example/sw/snap_example -C0 -a2 -S10000 -N100
^C
haver@tul3eth3:~/snap$ ./actions/hdl_example/sw/snap_example -C0 -a2 -S10000 -N1
ERROR: snap_card_alloc_dev(/dev/cxl/afu0.0s)
haver@tul3eth3:~/snap$ SNAP_TRACE=0xf ./software/tools/snap_maint -C0 -vv
[main] Enter
D hw_snap_card_alloc_dev Enter /dev/cxl/afu0.0m
D   hw_snap_card_alloc_dev: errinfo_size: 4096 VendorID: ffff DeviceID: ffff
.......D hw_snap_card_alloc_dev Exit Err
[main] Exit rc: 19
haver@tul3eth3:~/snap$

Some remarks about the occurrence likelihood: I saw this sometimes happen immediately, and it took 1-2 sec after the program abort, to really end the application. In those cases, the card turned out unusable and needed the full card reset sequence via sysfs interface. In some other cases, I aborted the program obviously in a not so critical situation, and it was still usable. I needed roughly about 5 times to make it die for sure. So do not be disappointed if the reproduction is not immediately successful, but try several times and you will see the problematic situation.

fhaverkamp commented 6 years ago

Recovery procedure

Unexpected program abort, which might happen if Cntl+C or any other unexpected program conditions occur, e.g. not handling a signal, programming errors, leading to sudden program abort or alike can cause the CAPI card not being accessible anymore. See tests above.

To recover the card, please do the following as super-user:

root@tul3eth3:# echo 100000 > /sys/kernel/debug/powerpc/eeh_max_freezes
root@tul3eth3:# echo 1 > /sys/class/cxl/card0/reset

Now wait a little until the card has performed the reset cycle and use snap_maint to get the card operational again. Repeat the experiment.

There might even be unexpected reboots of the host-system. If you observe that, please ensure that the latest firmware version is used.

If the suggested reset step above does not recover the card, please power off the system, and turn it on again to ensure full card/system recovery.

fhaverkamp commented 6 years ago

Results of todays testing of RC_llcmd branch containing Svens and Eberhards latest changes.

haver@tul3eth3:~/snap$ ./actions/hdl_example/tests/10140000_kill_test.sh -C0 -i1 -p32 -k3
********************************************************************
Parallel TEST for card 0 starting 32 processes
********************************************************************

[main] Enter
[snap_version] Enter
SNAP on N250S Card, NVME disabled, 4096 MB SRAM available.
SNAP FPGA Release: v1.2.2 Distance: 80 GIT: 0xc7cae948
SNAP FPGA Build (Y/M/D): 2018/01/21 Time (H:M): 19:16
SNAP FPGA CIR Master: 1 My ID: 0
SNAP FPGA Up Time: 52828 sec
[snap_version] Exit
[snap_m_init] Enter
SNAP FPGA Exploration already done (MSAT: 1 MAID: 1)

   Short |  Action Type |   Level   |
   ------+--------------+-----------+-----------
     0     0x10140000     0x00000000  IBM SNAP framework example

[snap_m_init] Exit rc: 0
[main] Exit rc: 0
(1) Check if card is replying to a single memcopy request (1) ... Mon Jan 22 11:08:50 CET 2018
(2) Perform massive stress and killing applications ...
Starting snap_example in the background ... 
ok
Waiting 3 seconds ...
.... ok
Sending SIGKILL to all ... 
Running:   17487 17488 17489 17490 17491 17492 17493 17494 17495 17496 17497 17498 17499 17500 17501 17502 17503 17504 17505 17506 17507 17508 17509 17510 17511 17512 17513 17514 17515 17516 17517 17518
Expected:  17487 17488 17489 17490 17491 17492 17493 17494 17495 17496 17497 17498 17499 17500 17501 17502 17503 17504 17505 17506 17507 17508 17509 17510 17511 17512 17513 17514 17515 17516 17517 17518
Still running: 
ok
(3) Check logfiles for string "err" ...
(4) Check if card is still replying to a single memcopy request ...
Error: Can not attach Action: 10140000
       Try to run snap_main tool
err: Single snap_example took to long, please review results!
haver@tul3eth3:~/snap$

First I see is that reconnection after aborting is still not possible.

haver@tul3eth3:~/snap$ SNAP_TRACE=0xff ./actions/hdl_example/sw/snap_example -C0 -S1000 -N1 -I -vv
Open Card: 0
D hw_snap_card_alloc_dev Enter /dev/cxl/afu0.0s
D   hw_snap_card_alloc_dev: errinfo_size: 4096 VendorID: 1014 DeviceID: 632
D hw_snap_card_alloc_dev Exit 0x1002e620420 OK Context: 0 Master: 0
D   hw_card_ioctl CARD_TYPE: 1
D   hw_card_ioctl Get MEM: 4096 MB
SNAP on N250S Card, 4096 MB of Card Ram avilable.
R   hw_snap_mmio_read64(0x1002e620420, a0, 0) 0
Start of Action: 1 Card Handle: 0x1002e620420 Context: 0
D hw_attach_action Enter Action: 0x10140000 Old Action: ffffffff Flags: 0x10001 Base: 0 timeout: 5 sec Seq: 0
R   hw_snap_mmio_read64(0x1002e620420, 18, 100) 0
R   hw_snap_mmio_read64(0x1002e620420, 100, 10140000) 0
R   hw_snap_mmio_write64(0x1002e620420, 1000, f000000000000005)
R   hw_snap_mmio_write64(0x1002e620420, 1010, f000000000000001)
D   hw_wait_irq: Enter fd: 3 Flags: 0x10001 Expect irq: 2 Timeout: 5 sec
D     Timeout......
D   hw_wait_irq: Exit fd: 3 rc: 16
D hw_attach_action Exit rc: 16 Action: (nil) Base: 0x0
Error: Can not attach Action: 10140000
       Try to run snap_main tool
Free Card Handle: 0x1002e620420
End of Test rc: 1
haver@tul3eth3:~/snap$

So we get a timeout when trying to attach the action. Last time we tried it was not able to open the card at all. So that is a change/improvement over the last experiment. Repeating was possible after sysfs reset of the card. Still same effect, we could open the card, but reattaching the action is not working yet.

Here the 2nd try, looks the same like the 1st one:

haver@tul3eth3:~/snap$ SNAP_TRACE=0xff ./actions/hdl_example/sw/snap_example -C0 -S1000 -N1 -I -vv
Open Card: 0
D hw_snap_card_alloc_dev Enter /dev/cxl/afu0.0s
D   hw_snap_card_alloc_dev: errinfo_size: 4096 VendorID: 1014 DeviceID: 632
D hw_snap_card_alloc_dev Exit 0x10034760420 OK Context: 0 Master: 0
D   hw_card_ioctl CARD_TYPE: 1
D   hw_card_ioctl Get MEM: 4096 MB
SNAP on N250S Card, 4096 MB of Card Ram avilable.
R   hw_snap_mmio_read64(0x10034760420, a0, 0) 0
Start of Action: 1 Card Handle: 0x10034760420 Context: 0
D hw_attach_action Enter Action: 0x10140000 Old Action: ffffffff Flags: 0x10001 Base: 0 timeout: 5 sec Seq: 0
R   hw_snap_mmio_read64(0x10034760420, 18, 100) 0
R   hw_snap_mmio_read64(0x10034760420, 100, 10140000) 0
R   hw_snap_mmio_write64(0x10034760420, 1000, f000000000000005)
R   hw_snap_mmio_write64(0x10034760420, 1010, f000000000000001)
D   hw_wait_irq: Enter fd: 3 Flags: 0x10001 Expect irq: 2 Timeout: 5 sec
D     Timeout......
D   hw_wait_irq: Exit fd: 3 rc: 16
D hw_attach_action Exit rc: 16 Action: (nil) Base: 0x0
Error: Can not attach Action: 10140000
       Try to run snap_main tool
Free Card Handle: 0x10034760420
End of Test rc: 1
haver@tul3eth3:~/snap$

fhaverkamp commented 6 years ago

dmsg looks like this, to whom it may concern. I do not see any spectacular hint what is wrong:

[Fri Jan 19 17:09:56 2018] ------------[ cut here ]------------
[Fri Jan 19 17:09:56 2018] WARNING: at /build/linux-yAniq4/linux-4.4.0/drivers/misc/cxl/native.c:866
[Fri Jan 19 17:09:56 2018] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables powernv_rng ipmi_powernv ipmi_msghandler leds_powernv uio_pdrv_genirq ibmpowernv uio vmx_crypto autofs4 ses enclosure scsi_transport_sas cxl ipr
[Fri Jan 19 17:09:56 2018] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.4.0-105-generic #128-Ubuntu
[Fri Jan 19 17:09:56 2018] task: c000000ff087b3d0 ti: c000001ffff88000 task.ti: c000000ff08f4000
[Fri Jan 19 17:09:56 2018] NIP: d00000000eb24ea4 LR: d00000000eb24ea0 CTR: c0000000005a0b40
[Fri Jan 19 17:09:56 2018] REGS: c000001ffff8baf0 TRAP: 0700   Tainted: G        W        (4.4.0-105-generic)
[Fri Jan 19 17:09:56 2018] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28002444  XER: 20000000
[Fri Jan 19 17:09:56 2018] CFAR: c000000000002160 SOFTE: 0 
                           GPR00: d00000000eb24ea0 c000001ffff8bd70 d00000000eb50868 000000000000001d 
                           GPR04: c000000ff8849c50 c000000ff885b4e0 000000000000002d 00000000000004c3 
                           GPR08: 0000000000000007 0000000000000000 0000000ff7880000 0000000000000006 
                           GPR12: 0000000000002200 c00000000fb40980 c00000000163ef18 c000000000fdd698 
                           GPR16: 0000000000000001 c000000ff08f4080 c000000000fc9c28 0000000000000008 
                           GPR20: c000000ff08f4000 c0000000015894f8 0000000000000001 0000000000000002 
                           GPR24: 00000000000001c8 c000001e47b22600 0000000000000000 0000000000000000 
                           GPR28: 0000000000000000 00000000000001c8 00000000000001c8 c000001e47d27800 
[Fri Jan 19 17:09:56 2018] NIP [d00000000eb24ea4] native_slice_irq_err+0x324/0x330 [cxl]
[Fri Jan 19 17:09:56 2018] LR [d00000000eb24ea0] native_slice_irq_err+0x320/0x330 [cxl]
[Fri Jan 19 17:09:56 2018] Call Trace:
[Fri Jan 19 17:09:56 2018] [c000001ffff8bd70] [d00000000eb24ea0] native_slice_irq_err+0x320/0x330 [cxl] (unreliable)
[Fri Jan 19 17:09:56 2018] [c000001ffff8be10] [c000000000131940] handle_irq_event_percpu+0xa0/0x320
[Fri Jan 19 17:09:56 2018] [c000001ffff8bed0] [c000000000131c28] handle_irq_event+0x68/0xc0
[Fri Jan 19 17:09:56 2018] [c000001ffff8bf00] [c000000000136f4c] handle_fasteoi_irq+0xec/0x2b0
[Fri Jan 19 17:09:56 2018] [c000001ffff8bf30] [c000000000130b24] generic_handle_irq+0x54/0x80
[Fri Jan 19 17:09:56 2018] [c000001ffff8bf60] [c000000000011260] __do_irq+0x80/0x1d0
[Fri Jan 19 17:09:56 2018] [c000001ffff8bf90] [c000000000024860] call_do_irq+0x14/0x24
[Fri Jan 19 17:09:56 2018] [c000000ff08f7a50] [c000000000011448] do_IRQ+0x98/0x140
[Fri Jan 19 17:09:56 2018] [c000000ff08f7aa0] [c000000000002594] hardware_interrupt_common+0x114/0x180
[Fri Jan 19 17:09:56 2018] --- interrupt: 501 at snooze_loop+0x160/0x1b0
                               LR = snooze_loop+0x180/0x1b0
[Fri Jan 19 17:09:56 2018] [c000000ff08f7d90] [c000000ff08f7dd0] 0xc000000ff08f7dd0 (unreliable)
[Fri Jan 19 17:09:56 2018] [c000000ff08f7dd0] [c000000000934180] cpuidle_enter_state+0x160/0x410
[Fri Jan 19 17:09:56 2018] [c000000ff08f7e30] [c00000000011adf8] call_cpuidle+0x78/0xd0
[Fri Jan 19 17:09:56 2018] [c000000ff08f7e70] [c00000000011b1cc] cpu_startup_entry+0x37c/0x490
[Fri Jan 19 17:09:56 2018] [c000000ff08f7f30] [c0000000000456ec] start_secondary+0x33c/0x360
[Fri Jan 19 17:09:56 2018] [c000000ff08f7f90] [c000000000008b6c] start_secondary_prolog+0x10/0x14
[Fri Jan 19 17:09:56 2018] Instruction dump:
[Fri Jan 19 17:09:56 2018] 8129fff8 2f890001 419efe6c 4bfffea4 60000000 60000000 60420000 3ce20000 
[Fri Jan 19 17:09:56 2018] 7fc4f378 e8678298 4801980d e8410018 <0fe00000> 4bfffd20 60420000 3c4c0003 
[Fri Jan 19 17:09:56 2018] ---[ end trace 394d3de5c41c64fa ]---
[Fri Jan 19 17:09:56 2018] cxl afu0.0: PSL_SERR_An: 0x0800000000002801
[Fri Jan 19 17:09:56 2018] cxl afu0.0: PSL_FIR_SLICE_An: 0x0000000000000000
[Fri Jan 19 17:09:56 2018] cxl afu0.0: CXL_PSL_ErrStat_An: 0x0000000000000000
[Fri Jan 19 17:09:56 2018] cxl afu0.0: CXL_PSL_AFU_DEBUG_An: 0x0000000000000000
[Sun Jan 21 20:26:39 2018] cxl-pci 0002:01:00.0: CXL reset
....
[Mon Jan 22 11:23:31 2018] EEH: Fenced PHB#2 detected, location: N/A
[Mon Jan 22 11:23:31 2018] Fatal Hypervisor Maintenance interrupt [Recovered]
[Mon Jan 22 11:23:31 2018]  Error detail: Malfunction Alert
[Mon Jan 22 11:23:31 2018]  HMER: 8040000000000000
[Mon Jan 22 11:23:31 2018]  Unknown Core check stop.
[Mon Jan 22 11:23:31 2018] Harmless Hypervisor Maintenance interrupt [Recovered]
[Mon Jan 22 11:23:31 2018]  Error detail: Unknown
[Mon Jan 22 11:23:31 2018]  HMER: 8040000000000000
[Mon Jan 22 11:23:31 2018] Fatal Hypervisor Maintenance interrupt [Recovered]
[Mon Jan 22 11:23:31 2018]  Error detail: Malfunction Alert
[Mon Jan 22 11:23:31 2018]  HMER: 8040000000000000
[Mon Jan 22 11:23:31 2018]  Unknown Core check stop.
[Mon Jan 22 11:23:31 2018] Harmless Hypervisor Maintenance interrupt [Recovered]
[Mon Jan 22 11:23:31 2018]  Error detail: Unknown
[Mon Jan 22 11:23:31 2018]  HMER: 8040000000000000
[Mon Jan 22 11:23:31 2018] EEH: This PCI device has failed 3 times in the last hour
[Mon Jan 22 11:23:31 2018] EEH: Notify device drivers to shutdown
[Mon Jan 22 11:23:31 2018] cxl-pci 0002:01:00.0: reflashing, so opting out of EEH!
[Mon Jan 22 11:23:31 2018] EEH: Collect temporary log
[Mon Jan 22 11:23:31 2018] PHB3 PHB#2 Diag-data (Version: 1)
[Mon Jan 22 11:23:31 2018] brdgCtl:     0000ffff
[Mon Jan 22 11:23:31 2018] UtlSts:      00200000 00000000 00000000
[Mon Jan 22 11:23:31 2018] RootSts:     ffffffff ffffffff ffffffff ffffffff 0000ffff
[Mon Jan 22 11:23:31 2018] RootErrSts:  ffffffff ffffffff ffffffff
[Mon Jan 22 11:23:31 2018] RootErrLog:  ffffffff ffffffff ffffffff ffffffff
[Mon Jan 22 11:23:31 2018] RootErrLog1: ffffffff 0000000000000000 0000000000000000
[Mon Jan 22 11:23:31 2018] nFir:        0000001000000000 0030006e00000000 0000001000000000
[Mon Jan 22 11:23:31 2018] PhbSts:      0000001c00000000 0000001c00000000
[Mon Jan 22 11:23:31 2018] OutErr:      0200000000000000 0200000000000000 1210056000020000 0000a00000000000
[Mon Jan 22 11:23:31 2018] PE[  0] A/B: 8000000000000000 8000000000000000
[Mon Jan 22 11:23:31 2018] EEH: Reset with hotplug activity
[Mon Jan 22 11:23:31 2018] pci_bus 0004:00: busn_res: [bus 00] is released
[Mon Jan 22 11:23:31 2018] cxl afu0.0: Deactivating AFU directed mode
[Mon Jan 22 11:23:31 2018] cxl afu0.0: PSL Purge called with link down, ignoring
[Mon Jan 22 11:23:31 2018] pci_bus 0002:01: busn_res: [bus 01] is released
[Mon Jan 22 11:23:38 2018] EEH: Sleep 5s ahead of complete hotplug
[Mon Jan 22 11:23:43 2018] pci 0002:00:00.0: [1014:03dc] type 01 class 0x060400
[Mon Jan 22 11:23:43 2018] pci 0002:00:00.0: PME# supported from D0 D3hot D3cold
[Mon Jan 22 11:23:43 2018] pci 0002:01:00.0: [1014:0477] type 00 class 0x1200ff
[Mon Jan 22 11:23:43 2018] pci 0002:01:00.0: reg 0x10: [mem 0x240000000000-0x24000fffffff 64bit pref]
[Mon Jan 22 11:23:43 2018] pci 0002:01:00.0: reg 0x18: [mem 0x240010000000-0x24001001ffff 64bit pref]
[Mon Jan 22 11:23:43 2018] pci 0002:01:00.0: reg 0x20: [mem 0x00000000-0x3fffffffff 64bit pref]
[Mon Jan 22 11:23:43 2018] pci 0002:00:00.0: PCI bridge to [bus 01]
[Mon Jan 22 11:23:43 2018] pci 0002:00:00.0:   bridge window [mem 0x240000000000-0x2400ffffffff 64bit pref]
[Mon Jan 22 11:23:43 2018] pci 0002:01:00.0: disabling BAR 4: [mem size 0x4000000000 64bit pref] (bad alignment 0x4000000000)
[Mon Jan 22 11:23:43 2018] pci 0002:00:00.0: BAR 15: assigned [mem 0x240000000000-0x2400ffffffff 64bit pref]
[Mon Jan 22 11:23:43 2018] pci 0002:01:00.0: BAR 0: assigned [mem 0x240000000000-0x24000fffffff 64bit pref]
[Mon Jan 22 11:23:43 2018] pci 0002:01:00.0: BAR 2: assigned [mem 0x240010000000-0x24001001ffff 64bit pref]
[Mon Jan 22 11:23:43 2018] pci 0002:00:00.0: PCI bridge to [bus 01]
[Mon Jan 22 11:23:43 2018] pci 0002:00:00.0:   bridge window [mem 0x240000000000-0x2400ffffffff 64bit pref]
[Mon Jan 22 11:23:43 2018] cxl-pci 0002:01:00.0: enabling device (0140 -> 0142)
[Mon Jan 22 11:23:43 2018] cxl-pci 0002:01:00.0: switch card to CXL
[Mon Jan 22 11:23:43 2018] pci 0002:01     : [PE# 000] Switching PHB to CXL
[Mon Jan 22 11:23:43 2018] pci 0002:01     : [PE# 000] Switching PHB to CXL
[244385.496029] PCI: Memory resource 0 not set for host bridge /pciex@3fffe40400000/pci@0/device@0 (domain 4)
[Mon Jan 22 11:23:43 2018] PCI: I/O resource not set for host bridge /pciex@3fffe40400000/pci@0/device@0 (domain 4)
[Mon Jan 22 11:23:43 2018] PCI: Memory resource 0 not set for host bridge /pciex@3fffe40400000/pci@0/device@0 (domain 4)
[Mon Jan 22 11:23:43 2018] cxl-pci 0002:01:00.0: PCI host bridge to bus 0004:00
[Mon Jan 22 11:23:43 2018] pci_bus 0004:00: root bus resource [bus 00]
[Mon Jan 22 11:23:43 2018] pci_bus 0004:00: busn_res: [bus 00] end is updated to ff
[Mon Jan 22 11:23:43 2018] pci 0004:00:00.0: [1014:0632] type 00 class 0x120000
[Mon Jan 22 11:23:43 2018] pci_bus 0004:00: busn_res: [bus 00-ff] end is updated to 00
[Mon Jan 22 11:23:43 2018] cxl afu0.0: Activating AFU directed mode
[Mon Jan 22 11:23:43 2018] EEH: Notify device driver to resume

jsvogt commented 6 years ago

Testcase is done, just new test results. Let's close this. If other new bugs show up during testing, please open a new issue.

fhaverkamp commented 6 years ago

Should be fine to close this.

open-power / snap

Create Cntl+C/Abort Testcase for SNAP #536

Recovery procedure