Closed EngineLessCC closed 4 days ago
@Alan-Jowett , it is unclear if the PR (dynamic update of proc) fixes the issue mentioned here. Keeping the issue untriaged, so that we can get your input in this matter.
The issue is as follows: The API KeQueryMaximumProcessorCountEx returns the maximum number of CPUs that the system will ever have (i.e. via hot-add or other mechanism that brings new CPUs online). The issue is that we can't pin DPCs to CPUs that are reported by KeQueryMaximumProcessorCountEx, but not yet active.
The solution is to track when CPUs become active (i.e. have BPF code running on them), so that the DPC can then be pinned on the CPU the first time it calls epoch enter.
After adding code to track initial activation of CPUs, it was only a small step to support deactivation of idle CPUs (for power efficiency).
@Alan-Jowett
The issue is as follows: The API KeQueryMaximumProcessorCountEx returns the maximum number of CPUs that the system will ever have (i.e. via hot-add or other mechanism that brings new CPUs online). The issue is that we can't pin DPCs to CPUs that are reported by KeQueryMaximumProcessorCountEx, but not yet active.
The solution is to track when CPUs become active (i.e. have BPF code running on them), so that the DPC can then be pinned on the CPU the first time it calls epoch enter.
After adding code to track initial activation of CPUs, it was only a small step to support deactivation of idle CPUs (for power efficiency).
The description of this issue states "KeQueryMaximumProcessorCountEx may return a wrong amount of CPUs on some Systems". I don't follow how your answer is related to that statement.
Perhaps "KeQueryMaximumProcessorCountEx may return a wrong amount of CPUs on some Systems" should be rephrased in the description as "some systems [at some times] have KeQueryMaximumProcessorCountEx > KeQueryActiveProcessorCountEx" which is true and not a bug.
Perhaps in some cases KeQueryMaximumProcessorCountEx > KeQueryActiveProcessorCountEx is caused by a bug, but that is orthogonal to eBPF's handling of the generally legitimate scenario.
IIRC usersim can already emulate this scenario. To test this in kernel mode on arbitrary HW, https://learn.microsoft.com/en-us/windows-hardware/drivers/devtest/pnpcpu, which will cause the MaxCount > ActiveCount, and optionally "hot-add" CPUs at runtime by making them active. I'm not aware of any rollback coverage where a partially hot-added CPU is abandoned.
I have been able to experience this issue on two of my systems:
Baremetal: Single socket AMD Ryzen 9 7900 12 Core ^ I am certain this cpu reports a fixed amount of cores (12C 24T), so I'm not sure why KeQueryMaximumProcessorCountEx gives a faulty result here.
KVM+QEMU: EPYC 7402P 24 Core Host: Virt => 1 Socket, 16 virtual cores
And at the moment, neither of these can load the driver due to this issue
@mtfriesen You are correct. The issue occurs when KeQueryMaximumProcessorCountEx > KeQueryActiveProcessorCountEx, irrespective of the cause (i.e. if the maximum processor count is higher due to a bug in the firmware or if it's higher due to the system actually supporting hot-add).
Note: The fix doesn't rely on KeRegisterProcessorChangeCallback. Instead, processors are only "activated" from eBPF point of view when ebpf_epoch_enter on that CPU. This solves both this issue as well as avoiding queueing spurious DPCs to CPUs that aren't doing eBPF related work.
Discussion in the triage meeting: bug in epoch logic, if max proc count exceeds active proc count. The solution is to support hot add/remove. @Alan-Jowett , please create an issue to add test coverage for this case.
Describe the bug
KeQueryMaximumProcessorCountEx
may return a wrong amount of CPUs on some Systems, which can lead the driver to fail and bail during ebpf_epoch_initiateTraced and found this issue together with @Alan-Jowett
As per their suggestion, migrating to
KeRegisterProcessorChangeCallback
should fix this issue.OS information
Windows Server 2022 (20348.1.amd64fre.fe_release.210507-1500)
Steps taken to reproduce bug
Expected behavior
Actual outcome
Driver load failure with status code 87 (invalid parameter)
Additional details
Recorded trace:
Failed api:
KeGetProcessorNumberFromIndex