PCI passthrough: Frozen PE / EEH recovery happens in the host if driver is loaded after the guest is shutdown and device is reattached to the host

mfoliveira commented 7 years ago

Scenario: PCI passthrough of the SAS3008-based PCIe adapter in the 8001-22C system.

# lspci -nnv -s 1:3:0.0 | head -n2
0001:03:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02)
Subsystem: Super Micro Computer Inc Device [15d9:0808]

Steps to reproduce: 1) Host: detach the adapter (virsh nodedev-detach pci_0001_03_00_0) 1) Host: start a guest with PCI passthrough (virsh start --console <guest>) 2) Guest: load the driver (initializes the adapter, scans for disks, etc) (modprobe mpt3sas) 3) Guest: shutdown (poweroff) 4) Host: reattach the adapter (virsh nodedev-reattach pci_0001_03_00_0) 5) Host: load the driver (starts to init the adapter and hits Frozen PE/EEH recovery) (modprobe mpt3sas)

During driver initialization the following Frozen PE / EEH recovery is consistently observed. There is an Oops in the driver code afterward, but that's another problem which I'll be looking at.

Decoding the PEST bits tells this is a DMA write w/ invalid page access. The suspicion is there are pending operations/configuration from the guest, and since the PE was not reset in a way that could actually clear these in this adapter, the problem is hit.

In that scenario, this problem is expected to be resolved by the patch series which was applied downstream on PowerKVM [1], and now is being worked in a VFIO-based approach by @aik .

[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-February/124867.html

[  759.825059] mpt3sas 0001:03:00.0: enabling device (0400 -> 0402)
[  759.825165] mpt3sas 0001:03:00.0: Using 64-bit DMA iommu bypass
[  759.825223] mpt3sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (535679552 kB)
[  759.882919] mpt3sas_cm0: MSI-X vectors supported: 96, no of cores: 16, max_msix_vectors: -1
[  759.883772] mpt3sas0-msix0: PCI-MSI-X enabled: IRQ 706
[  759.883819] mpt3sas0-msix1: PCI-MSI-X enabled: IRQ 707
[  759.883863] mpt3sas0-msix2: PCI-MSI-X enabled: IRQ 708
[  759.883906] mpt3sas0-msix3: PCI-MSI-X enabled: IRQ 709
[  759.883949] mpt3sas0-msix4: PCI-MSI-X enabled: IRQ 710
[  759.883993] mpt3sas0-msix5: PCI-MSI-X enabled: IRQ 711
[  759.884035] mpt3sas0-msix6: PCI-MSI-X enabled: IRQ 712
[  759.884080] mpt3sas0-msix7: PCI-MSI-X enabled: IRQ 713
[  759.884123] mpt3sas0-msix8: PCI-MSI-X enabled: IRQ 714
[  759.884166] mpt3sas0-msix9: PCI-MSI-X enabled: IRQ 715
[  759.884210] mpt3sas0-msix10: PCI-MSI-X enabled: IRQ 716
[  759.884297] mpt3sas0-msix11: PCI-MSI-X enabled: IRQ 717
[  759.884339] mpt3sas0-msix12: PCI-MSI-X enabled: IRQ 718
[  759.884382] mpt3sas0-msix13: PCI-MSI-X enabled: IRQ 719
[  759.884427] mpt3sas0-msix14: PCI-MSI-X enabled: IRQ 720
[  759.884471] mpt3sas0-msix15: PCI-MSI-X enabled: IRQ 721
[  759.884516] mpt3sas_cm0: iomem(0x00003fe080140000), mapped(0xd0000800810a0000), size(65536)
[  759.884582] mpt3sas_cm0: ioport(0x0000000000000000), size(0)
[  759.975501] mpt3sas_cm0: Allocated physical memory: size(8887 kB)
[  759.975563] mpt3sas_cm0: Current Controller Queue Depth(2936),Max Controller Queue Depth(3072)
[  759.975636] mpt3sas_cm0: Scatter Gather Elements per IO(128)
[  760.021015] EEH: Frozen PE#fd on PHB#1 detected
[  760.021106] EEH: PE location: PLX Slot1, PHB location: N/A
[  760.021873] EEH: This PCI device has failed 1 times in the last hour
[  760.021927] EEH: Notify device drivers to shutdown
[  760.021970] mpt3sas_cm0: PCI error: detected callback, state(2)!!
[  760.022317] EEH: Collect temporary log
[  760.022378] EEH: of node=0001:03:00.0
[  760.022414] EEH: PCI device/vendor: 00971000
[  760.022461] EEH: PCI cmd/status register: 00180142
[  760.022503] EEH: PCI-E capabilities and status follow:
[  760.022558] EEH: PCI-E 00: 0002a810 10008025 0000281e 00415083 
[  760.022620] EEH: PCI-E 10: 10830000 00000000 00000000 00000000 
[  760.022675] EEH: PCI-E 20: 00000000 
[  760.022706] EEH: PCI-E AER capability register set follows:
[  760.022758] EEH: PCI-E AER 00: 1e020001 00000000 00000000 00462031 
[  760.022821] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000 
[  760.022881] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
[  760.022935] EEH: PCI-E AER 30: 00000000 00000000 
[  760.022979] PHB3 PHB#1 Diag-data (Version: 1)
[  760.023022] brdgCtl:     00000002
[  760.023059] RootSts:     0000000f 00400000 b0830008 00100147 00002000
[  760.023112] PhbSts:      0000001c00000000 0000001c00000000
[  760.023156] Lem:         0000000004000000 42498e367f502eae 0000000000000000
[  760.023210] InAErr:      0000000000004000 0000000000004000 00000000612400fd 04000000000000fd
[  760.023284] PE[253] A/B: 8000302500000000 8000000061240000
[  760.023325] EEH: Reset without hotplug activity
[  762.174778] EEH: Notify device drivers the completion of reset
[  762.174860] mpt3sas_cm0: PCI error: slot reset callback!!
[  762.174985] mpt3sas 0001:03:00.0: Using 64-bit DMA iommu bypass
[  762.175044] mpt3sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (535679552 kB)
[  762.232259] mpt3sas_cm0: MSI-X vectors supported: 96, no of cores: 16, max_msix_vectors: -1
[  762.233046] mpt3sas0-msix0: PCI-MSI-X enabled: IRQ 706
[  762.233091] mpt3sas0-msix1: PCI-MSI-X enabled: IRQ 707
[  762.233135] mpt3sas0-msix2: PCI-MSI-X enabled: IRQ 708
[  762.233179] mpt3sas0-msix3: PCI-MSI-X enabled: IRQ 709
[  762.233223] mpt3sas0-msix4: PCI-MSI-X enabled: IRQ 710
[  762.233266] mpt3sas0-msix5: PCI-MSI-X enabled: IRQ 711
[  762.233309] mpt3sas0-msix6: PCI-MSI-X enabled: IRQ 712
[  762.233352] mpt3sas0-msix7: PCI-MSI-X enabled: IRQ 713
[  762.233395] mpt3sas0-msix8: PCI-MSI-X enabled: IRQ 714
[  762.233439] mpt3sas0-msix9: PCI-MSI-X enabled: IRQ 715
[  762.233482] mpt3sas0-msix10: PCI-MSI-X enabled: IRQ 716
[  762.233525] mpt3sas0-msix11: PCI-MSI-X enabled: IRQ 717
[  762.233569] mpt3sas0-msix12: PCI-MSI-X enabled: IRQ 718
[  762.233612] mpt3sas0-msix13: PCI-MSI-X enabled: IRQ 719
[  762.233656] mpt3sas0-msix14: PCI-MSI-X enabled: IRQ 720
[  762.233699] mpt3sas0-msix15: PCI-MSI-X enabled: IRQ 721
[  762.233743] mpt3sas_cm0: iomem(0x00003fe080140000), mapped(0xd0000800813b0000), size(65536)
[  762.233806] mpt3sas_cm0: ioport(0x0000000000000000), size(0)
[  762.234135] mpt3sas_cm0: _base_event_notification: timeout
[  762.234182] mf:
    [  762.234204] 07000000 
00000000 [  762.234238] 00000000 
00000000 [  762.234272] 00000000 
0f2f7fff [  762.234305] ffffff7c 
ffffffff [  762.234339] 
[  762.234339]  
ffffffff [  762.234384] 00000000 
00000000 [  762.234418] 
[  762.236160] Unable to handle kernel paging request for data at address 0xd0000800813b0030
[  762.236230] Faulting instruction address: 0xd000000031fb072c
[  762.236286] Oops: Kernel access of bad area, sig: 11 [#1]
[  762.236329] SMP NR_CPUS=1024 [  762.236351] NUMA 
[  762.236374] PowerNV
[  762.236399] Modules linked in: mpt3sas raid_class scsi_transport_sas vhost_net vhost macvtap macvlan ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c at24 nvmem_core ofpart ipmi_powernv powernv_flash ipmi_msghandler opal_prd mtd i2c_opal kvm_hv nfsd kvm_pr auth_rpcgss oid_registry nfs_acl lockd kvm grace sunrpc joydev ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops i40e ttm ixgbe mdio ptp drm pps_core i2c_core [last unloaded: raid_class][  762.237373] CPU: 8 PID: 779 Comm: eehd Tainted: G        W       4.9.0-4.el7.centos.ppc64le #1
[  762.237448] task: c000003fcf301500 task.stack: c000003fcf384000
[  762.237501] NIP: d000000031fb072c LR: d000000031fb070c CTR: c000000000115490
[  762.237564] REGS: c000003fcf3874b0 TRAP: 0300   Tainted: G        W        (4.9.0-4.el7.centos.ppc64le)
[  762.237638] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>[  762.237811]   CR: 24002084  XER: 20000000
[  762.237844] CFAR: c000000000a276a8 DAR: d0000800813b0030 DSISR: 40000000 SOFTE: 1 
GPR00: d000000031fb070c c000003fcf387730 d000000031fef390 d0000800813b0030 
GPR04: c000003fcf301500 0000000003fde404 00000060e3c47241 0000000000000000 
GPR08: c000003fed20ed00 d0000800813b0000 0000000000000000 00000000ffffffff 
GPR12: 0000000000002200 c00000000fdc4800 c0000000000fbd18 c000007949100040 
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR24: c000003fcf387920 0000000000000003 0000000000000005 0000000040000000 
GPR28: 0000000000001388 00000000c0000000 c000001f04e84810 0000000000000001 
NIP [d000000031fb072c] _base_wait_for_doorbell_ack+0x8c/0x1f0 [mpt3sas]
[  762.238822] LR [d000000031fb070c] _base_wait_for_doorbell_ack+0x6c/0x1f0 [mpt3sas]
[  762.238886] Call Trace:
[  762.238914] [c000003fcf387730] [d000000031fb070c] _base_wait_for_doorbell_ack+0x6c/0x1f0 [mpt3sas] (unreliable)
[  762.239015] [c000003fcf3877c0] [d000000031fb1c6c] _base_handshake_req_reply_wait+0x15c/0x7e0 [mpt3sas]
[  762.243871] [c000003fcf387880] [d000000031fb689c] _base_get_ioc_facts+0x10c/0x460 [mpt3sas]
[  762.250568] mpt3sas_cm0: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:8830/_scsih_probe()!
[  762.260515] [c000003fcf387950] [d000000031fb96d8] mpt3sas_base_hard_reset_handler+0x2c8/0x600 [mpt3sas]
[  762.270219] [c000003fcf387a30] [d000000031fbeba4] scsih_pci_slot_reset+0xa4/0x100 [mpt3sas]
[  762.278537] [c000003fcf387ab0] [c000000000042d48] eeh_report_reset+0x128/0x170
[  762.285474] [c000003fcf387b00] [c000000000041128] eeh_pe_dev_traverse+0x98/0x170
[  762.292412] [c000003fcf387b90] [c00000000004347c] eeh_handle_normal_event+0x3ec/0x510
[  762.300722] [c000003fcf387c30] [c000000000043858] eeh_handle_event+0x178/0x360
[  762.307665] [c000003fcf387ce0] [c000000000043bf8] eeh_event_handler+0x1b8/0x1c0
[  762.314598] [c000003fcf387d80] [c0000000000fbe20] kthread+0x110/0x130
[  762.321520] [c000003fcf387e30] [c00000000000c360] ret_from_kernel_thread+0x5c/0x7c
[  762.328465] Instruction dump:
[  762.332603] 40820074 386003e8 388005dc 48028219 e8410018 393f0001 7f9c4840 793f0020 
[  762.339539] 41de010c e93e00a8 38690030 7c0004ac <81290030> 0c090000 4c00012c 2f89ffff 
[  762.430835] ---[ end trace ee34b74dd6657653 ]---
[  762.430881]

$ ./pest 8000302500000000 8000000061240000
Transaction type: DMA Write
TCE Page Fault
TCE Access Fault
LEM Bit Number 37
Requestor 0:0.0
MSI Data 0x0000
Fault Address = 0x0000000061240000

rmatinata-ibm commented 7 years ago

@aik Can you please take a look into this, as soon as possible ? Thank you ! @paulusmack @laggarcia @bjking1 @sgarfinkle FYI

aik commented 7 years ago

On my local setup it is just enough to boot the host without mpt3sas driver and then simply do "modprobe mpt3sas" - there is an EEH exactly as reported here. Will continue on monday...

mfoliveira commented 7 years ago

Interesting.

Btw, If that oops after the eeh annoys you, i already have a patch for it, to be submitted; just let me know and i'll send it to you.

aik commented 7 years ago

Yes please send the patch. Thanks.

ps. I just cannot make mpt3sas load on the upstream kernel at all. hm.

mfoliveira commented 7 years ago

@aik just sent via e-mail.

aik commented 7 years ago

Thanks. Did not help though, it still crashes, slightly different, the net result is the same - mpt3sas does not bind to the device.

update. Turns out multilevel TCE tables for 32bit DMA do not work properly. Hm. Disabled them and can proceed.

How much RAM does the guest in the test get?

aik commented 7 years ago

In meanwhile, could you try patching QEMU like this?

diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index 9d090270f6..80a5d0e3dd 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -166,7 +166,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
     entries = create.window_size >> create.page_shift;
     pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
     pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
-    create.levels = ctz64(pages) / 6 + 1;
+    create.levels = 1;//ctz64(pages) / 6 + 1;

     ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
     if (ret) {

mfoliveira commented 7 years ago

@aik

Thanks. Did not help though, it still crashes, slightly different, the net result is the same - mpt3sas does not bind to the device.

Surprised; this patch fixed this problem for us in several tests. Can you please send me the Oops log? (at least stack trace + NIP/LR)

update. Turns out multilevel TCE tables for 32bit DMA do not work properly. Hm. Disabled them and can proceed.

Cool. But wasn't this adapter/driver doing 64-bit DMA? And, please, where can I find out more about multilevel TCE tables? (source files are OK; if there are docs, those are more than welcome).

How much RAM does the guest in the test get?

32 GiB.

In meanwhile, could you try patching QEMU like this?

Yes, I'll setup a local box to try it (the original one is not accessible in a client's network). It might take a while but I'll do it.

Thanks!

aik commented 7 years ago

Surprised; this patch fixed this problem for us in several tests. Can you please send me the Oops log? (at least stack trace + NIP/LR)

Does not make much sense, in fact everything trying to use 4-level TCE tables fails with EEH, 3 levels are ok, it has not been noticed so far because by default 32bit windows only use 1 level and most devices are 64bit only anyway; only my test branch exposed the problem which seems to be unrelated to what this bug is about.

Cool. But wasn't this adapter/driver doing 64-bit DMA?

It is using 32bit for coherent mask and 64bit for noncoherent, different DMA pages for different purposes.

And, please, where can I find out more about multilevel TCE tables? (source files are OK; if there are docs, those are more than welcome).

IODA spec describes it in "Multi-level table TCE Fetching".

In meanwhile, could you try patching QEMU like this? Yes, I'll setup a local box to try it (the original one is not accessible in a client's network). It might take a while but I'll do it.

Never mind, QEMU picks levels=1 for 32GB anyway so it won't make a difference.

For now, please try this particular patch on the host kernel: https://github.com/aik/linux/commit/cbd0c452d6a0221211ef4c87cb03cc6f01db1ae4

mfoliveira commented 7 years ago

Hi @aik

Surprised; this patch fixed this problem for us in several tests. Can you please send me the Oops log? (at least stack trace + NIP/LR)

Does not make much sense, in fact everything trying to use 4-level TCE tables fails with EEH [snip]

Okay, but I didn't say the patch fixes the Frozen PHB problem, only the Oops in themp3sas driver's slot-reset hook -- during the respective EEH recovery. :- ) If you still hit that Oops during EEH recovery there, I'd be interested in the stack trace/NIP/LR in order to improve the patch to catch more cases, please.

Cool. But wasn't this adapter/driver doing 64-bit DMA?

It is using 32bit for coherent mask and 64bit for noncoherent, different DMA pages for different purposes.

Ah.

And, please, where can I find out more about multilevel TCE tables? (source files are OK; if there are docs, those are more than welcome).

IODA spec describes it in "Multi-level table TCE Fetching".

Cool, thanks!

In meanwhile, could you try patching QEMU like this? [snip] Never mind, QEMU picks levels=1 for 32GB anyway so it won't make a difference.

Ack.

For now, please try this particular patch on the host kernel: aik@cbd0c45

Sure; posting results soon.

Thank you.

aik commented 7 years ago

Any luck?

mfoliveira commented 7 years ago

Any luck?

Sorry, should have posted news earlier.

While checking this patch I noticed there's something 'different' (a problem) happening in the guest, so I've been trying to confirm whether it's due to this patch, a regression between the 4.9 and the 4.10 kernel, a misbuilt qemu, or something else.

I can tell you that I no longer see this original problem in the host (very good news, thank you very much for the patch!), but I guess we cannot confirm it's all good until that other problem is understood.

I should return to this task today.

Thanks!

mfoliveira commented 7 years ago

Er, couldn't get to it today, sorry. Planning for tomorrow / Tuesday.

mfoliveira commented 7 years ago

So, it seems there's a regression w/ the 4.10 kernel in HostOS (without this patch applied) which produces adapter firmware faults in the PCI passthrough mode. This problem didn't happen w/ the 4.9 kernel.

I'll rebuild the 4.9 kernel w/ your patch, in order to validate it properly. Then track down this regression.

Sorry for the delay with this one.

mfoliveira commented 7 years ago

@aik

Your patch resolved the problem. Tested 3 times, no errors (the error occurred every single time without this patch).

Kernel package version used for comparison:

# uname -r
4.10.0-5.gitb0bad18.el7.centos.ppc64le

The regression I mentioned is present in the original/unpatched kernel, and is likely a VFIO thing.

Thank you.

open-power-host-os / linux

PCI passthrough: Frozen PE / EEH recovery happens in the host if driver is loaded after the guest is shutdown and device is reattached to the host #11