Closed AboorvaDevarajan closed 3 months ago
Thanks for reporting this @AboorvaDevarajan , I'm actually working on the CPU hotplugging support right now, I should be able to push a fix soon. Also, interesting to notice that you're using sched_ext on ppc64le!
@AboorvaDevarajan this should be fixed in #408. I've temporarily dropped the dependency on the Topology crate, because there's still an issue that can happen with CPU hotplugging (but that's a separate one, I'll look at that later). In the meantime scx_bpfland should handle CPU hotplugging properly with #408 applied.
Andrea, thanks for the fix. I tested the patch and it resolves the issue, bpfland
scheduler now runs without crashing.
However, I noticed a kernel panic when the CPU hotplug is carried out when running the custom scheduler and the system becomes unresponsive. This issue also occurs with the simple
scheduler as well, so it doesn't seem related to just bpfland
. so I'll investigate this a bit more and report it upstream (kernel).
@AboorvaDevarajan yeah this looks more like a kernel bug, feel free to open another issue and let me know if you are able to get more details. Thanks!
System Info:
Kernel Version: 6.10.0-rc2+ +
struct_ops
patches on ppc64le.SCX Version: Latest upstream
Steps to recreate the issue:
Run the
bpfland
scheduler.Execute the following command to stress the CPUs:
stress-ng --cpu=100
Offline CPUs sequentially from 2 to 127:
for i in {2..127}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
During the process of offlining CPUs, the system successfully unregisters some CPUs without issues. However, it occasionally encounters the following error:
When a CPU is offlined, its associated topology information in the sysfs is also unregistered and removed. However, the scheduler still tries to access this topology file even after it has been offlined which leads to this failure.
Error Output