opiproject / opi-prov-life

Provisioning, Lifecycle and Platform Management Group
Apache License 2.0
18 stars 26 forks source link

DPU fw upgrade/reboot caused Host crash due to PCIe DPC events #180

Open glimchb opened 1 year ago

glimchb commented 1 year ago

@ballle98 @tedstreete @jainvipin can you please add all the details and thoughts and debug info that you have on this, we can start bringing more people and don't want them to read the entire slack to understand the issue

tedstreete commented 1 year ago

@glimchb @ballle98 @jainvipin Here's an initial thought on DPU/HOST DPC behavior for Host-reset, DPU-reset and DPU OS install events.

Host OS Reset or Crash

DPU OS Reset or Crash

DPU OS install mode

glimchb commented 1 year ago

thanks @tedstreete this info is useful for https://github.com/opiproject/opi-prov-life/blob/main/BOOTSEQ.md

I was hoping we can use this issue to understand and debug why FW upgrade/reboot even cause Host to crash completely?

DPU OS crash or reset. DPU will trigger DPC/CI surprise remove event. Host OS and BMC/BIOS will need to handle these events gracefully so that Host OS remains active for long enough for the DPU OS crash dump to complete.

Why DPC is not working? do we have kernel dumps to attach here and show what happens when DPU reboots and causes Host to crash ?

tedstreete commented 1 year ago

@glimchb The primary issue is that neither of the two host OS properly manage PCI surprise remove events. The historical expectations that a failure of a PCIe device will always result in a Host OS crash. The introduction of independently functional devices, like DPUs, breaks that expectation.

OPI will need to determine what behaviors we want the host OS to offer in the event of DPU crash/reset/graceful-restart and then make the necessary changes to the Linux Kernel/PCIe subsystem and the host BIOS/BMC (iDRAC for Dell, iLo for HP etc.).

seroyer commented 1 year ago

Just as a data point, Fedora, CentOS, and RHEL all enable the DPC support by default.

For example: From a RHEL 8.6 host:

$ grep CONFIG_PCIE_DPC /boot/config-4.18.0-372.26.1.el8_6.x86_64
CONFIG_PCIE_DPC=y

And from the tip of rawhide:

$ grep CONFIG_PCIE_DPC kernel-x86_64-*.config
kernel-x86_64-debug-fedora.config:CONFIG_PCIE_DPC=y
kernel-x86_64-debug-rhel.config:CONFIG_PCIE_DPC=y
kernel-x86_64-fedora.config:CONFIG_PCIE_DPC=y
kernel-x86_64-rhel.config:CONFIG_PCIE_DPC=y
glimchb commented 1 year ago

Linux does have support for DPC, but the functionality is limited.

@tedstreete can you please elaborate ?

I know Intel is doing a lot of improvements in this area in next gen... do we have data from AMD as well ?

tedstreete commented 1 year ago

@seroyer @glimchb The primary issue is that the default behavior when a surprise removal occurs is to crash the OS. OPI need to determine what other behaviors we want the kernel to exhibit and then ensure that the kernel/PCIe subsystem/BIOS/BMC offer those options.

Additionally, while it's not mandatory if DPC events are managed gracefully, I'd argue that an ability to disable the Host/DPU PCIe link during DPU OS install/upgrade is a benefit we should explore.