Open mfoliveira opened 7 years ago
@aik Can you please take a look into this, as soon as possible ? Thank you ! @paulusmack @laggarcia @bjking1 @sgarfinkle FYI
On my local setup it is just enough to boot the host without mpt3sas driver and then simply do "modprobe mpt3sas" - there is an EEH exactly as reported here. Will continue on monday...
Interesting.
Btw, If that oops after the eeh annoys you, i already have a patch for it, to be submitted; just let me know and i'll send it to you.
Yes please send the patch. Thanks.
ps. I just cannot make mpt3sas load on the upstream kernel at all. hm.
@aik just sent via e-mail.
Thanks. Did not help though, it still crashes, slightly different, the net result is the same - mpt3sas does not bind to the device.
update. Turns out multilevel TCE tables for 32bit DMA do not work properly. Hm. Disabled them and can proceed.
How much RAM does the guest in the test get?
In meanwhile, could you try patching QEMU like this?
diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index 9d090270f6..80a5d0e3dd 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -166,7 +166,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
entries = create.window_size >> create.page_shift;
pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
- create.levels = ctz64(pages) / 6 + 1;
+ create.levels = 1;//ctz64(pages) / 6 + 1;
ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
if (ret) {
@aik
Thanks. Did not help though, it still crashes, slightly different, the net result is the same - mpt3sas does not bind to the device.
Surprised; this patch fixed this problem for us in several tests. Can you please send me the Oops log? (at least stack trace + NIP/LR)
update. Turns out multilevel TCE tables for 32bit DMA do not work properly. Hm. Disabled them and can proceed.
Cool. But wasn't this adapter/driver doing 64-bit DMA? And, please, where can I find out more about multilevel TCE tables? (source files are OK; if there are docs, those are more than welcome).
How much RAM does the guest in the test get?
32 GiB.
In meanwhile, could you try patching QEMU like this?
Yes, I'll setup a local box to try it (the original one is not accessible in a client's network). It might take a while but I'll do it.
Thanks!
Surprised; this patch fixed this problem for us in several tests. Can you please send me the Oops log? (at least stack trace + NIP/LR)
Does not make much sense, in fact everything trying to use 4-level TCE tables fails with EEH, 3 levels are ok, it has not been noticed so far because by default 32bit windows only use 1 level and most devices are 64bit only anyway; only my test branch exposed the problem which seems to be unrelated to what this bug is about.
Cool. But wasn't this adapter/driver doing 64-bit DMA?
It is using 32bit for coherent mask and 64bit for noncoherent, different DMA pages for different purposes.
And, please, where can I find out more about multilevel TCE tables? (source files are OK; if there are docs, those are more than welcome).
IODA spec describes it in "Multi-level table TCE Fetching".
In meanwhile, could you try patching QEMU like this? Yes, I'll setup a local box to try it (the original one is not accessible in a client's network). It might take a while but I'll do it.
Never mind, QEMU picks levels=1 for 32GB anyway so it won't make a difference.
For now, please try this particular patch on the host kernel: https://github.com/aik/linux/commit/cbd0c452d6a0221211ef4c87cb03cc6f01db1ae4
Hi @aik
Surprised; this patch fixed this problem for us in several tests. Can you please send me the Oops log? (at least stack trace + NIP/LR)
Does not make much sense, in fact everything trying to use 4-level TCE tables fails with EEH [snip]
Okay, but I didn't say the patch fixes the Frozen PHB problem, only the Oops in themp3sas
driver's slot-reset hook -- during the respective EEH recovery. :- )
If you still hit that Oops during EEH recovery there, I'd be interested in the stack trace/NIP/LR in order to improve the patch to catch more cases, please.
Cool. But wasn't this adapter/driver doing 64-bit DMA?
It is using 32bit for coherent mask and 64bit for noncoherent, different DMA pages for different purposes.
Ah.
And, please, where can I find out more about multilevel TCE tables? (source files are OK; if there are docs, those are more than welcome).
IODA spec describes it in "Multi-level table TCE Fetching".
Cool, thanks!
In meanwhile, could you try patching QEMU like this? [snip] Never mind, QEMU picks levels=1 for 32GB anyway so it won't make a difference.
Ack.
For now, please try this particular patch on the host kernel: aik@cbd0c45
Sure; posting results soon.
Thank you.
Any luck?
Any luck?
Sorry, should have posted news earlier.
While checking this patch I noticed there's something 'different' (a problem) happening in the guest, so I've been trying to confirm whether it's due to this patch, a regression between the 4.9 and the 4.10 kernel, a misbuilt qemu, or something else.
I can tell you that I no longer see this original problem in the host (very good news, thank you very much for the patch!), but I guess we cannot confirm it's all good until that other problem is understood.
I should return to this task today.
Thanks!
Er, couldn't get to it today, sorry. Planning for tomorrow / Tuesday.
So, it seems there's a regression w/ the 4.10 kernel in HostOS (without this patch applied) which produces adapter firmware faults in the PCI passthrough mode. This problem didn't happen w/ the 4.9 kernel.
I'll rebuild the 4.9 kernel w/ your patch, in order to validate it properly. Then track down this regression.
Sorry for the delay with this one.
@aik
Your patch resolved the problem. Tested 3 times, no errors (the error occurred every single time without this patch).
Kernel package version used for comparison:
# uname -r
4.10.0-5.gitb0bad18.el7.centos.ppc64le
The regression I mentioned is present in the original/unpatched kernel, and is likely a VFIO thing.
Thank you.
Scenario: PCI passthrough of the SAS3008-based PCIe adapter in the 8001-22C system.
Steps to reproduce: 1) Host: detach the adapter (
virsh nodedev-detach pci_0001_03_00_0
) 1) Host: start a guest with PCI passthrough (virsh start --console <guest>
) 2) Guest: load the driver (initializes the adapter, scans for disks, etc) (modprobe mpt3sas
) 3) Guest: shutdown (poweroff
) 4) Host: reattach the adapter (virsh nodedev-reattach pci_0001_03_00_0
) 5) Host: load the driver (starts to init the adapter and hits Frozen PE/EEH recovery) (modprobe mpt3sas
)During driver initialization the following Frozen PE / EEH recovery is consistently observed. There is an Oops in the driver code afterward, but that's another problem which I'll be looking at.
Decoding the PEST bits tells this is a DMA write w/ invalid page access. The suspicion is there are pending operations/configuration from the guest, and since the PE was not reset in a way that could actually clear these in this adapter, the problem is hit.
In that scenario, this problem is expected to be resolved by the patch series which was applied downstream on PowerKVM [1], and now is being worked in a VFIO-based approach by @aik .
[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-February/124867.html