rhkdump / kdump-utils

Kernel crash dump collection utilities
GNU General Public License v2.0
3 stars 8 forks source link

sysconfig: add pcie_ports compat to KDUMP_COMMANDLINE_APPEND on x86_64 #4

Closed liutgnu closed 2 months ago

liutgnu commented 2 months ago

There have been some of failing cases of kdump in 2nd kernel, where ususally only one cpu is enabled by "nr_cpus=1", but with a large number of devices, which may easily exceed the maximum IRQ resources of one cpu can handle. As a result, the 2nd kernel will hang and kdump fails. This issue is often observed on machines with many cpus and many devices.

On those systems, pcieports consume quite proportion of IRQ resources, many following message can be seen in dmesg log:

pcieport 0000:18:01.0: PME: Signaling with IRQ 109

According to kernel doc1, when "pcie_ports=compat" applied, it will disable native PCIe services (PME, AER, DPC, PCIe hotplug). Those functions are power management events, error reporting, performance, hotplug related, which are not the must-have functions for kdump. In addition, after testing, no side effects such as cannot writing vmcore into sdx, nvme etc been noticed.

This patch will disable native PCIe services for 2nd kernel, to saving the scarce IRQ resources and increase the kdump success.

prarit commented 2 months ago

This makes sense to me. The only concern anyone should have is that a PCIE error could have been responsible for taking down the kernel in the first place, and booting into the second kernel could then also have a fatal problem. I'm not sure we can ever fix that type of cascade of panics :) so it makes sense to disable these features.

liutgnu commented 2 months ago

Hi @prarit ,

Thanks for reviewing the patch and your comments!

This makes sense to me. The only concern anyone should have is that a PCIE error could have been responsible for taking down the kernel in the first place, and booting into the second kernel could then also have a fatal problem. I'm not sure we can ever fix that type of cascade of panics :) so it makes sense to disable these features.

If we do have the case (PCIE error taking down 1st kernel) happen, I guess even if we enabled all native pcieport services (PME, AER, DPC, PCIe hotplug), 2nd kernel will notice the PCIE error and crash as well right? If that is true, according to my experience, I haven't noticed such a PCIE crash in 2nd kernel. Any ideas? @daveyoung @baoquan-he

Thanks, Tao Liu

daveyoung commented 2 months ago

Hi @prarit ,

Thanks for reviewing the patch and your comments!

This makes sense to me. The only concern anyone should have is that a PCIE error could have been responsible for taking down the kernel in the first place, and booting into the second kernel could then also have a fatal problem. I'm not sure we can ever fix that type of cascade of panics :) so it makes sense to disable these features.

If we do have the case (PCIE error taking down 1st kernel) happen, I guess even if we enabled all native pcieport services (PME, AER, DPC, PCIe hotplug), 2nd kernel will notice the PCIE error and crash as well right? If that is true, according to my experience, I haven't noticed such a PCIE crash in 2nd kernel. Any ideas? @daveyoung @baoquan-he

I do not remember any such failure during the past years, it will be very rare, and as Prarit said that we probably can do nothing in kdump kernel. As we moved to github process I noticed no "Acked-by" or "Reviewed-by" any more. Tao, can you refresh the patch added Prarit's ack and also include the comments from Prarit in patch log.

liutgnu commented 2 months ago

The V2 posted at https://github.com/rhkdump/kdump-utils/pull/9, so I will close this one.