Closed vax-r closed 3 months ago
The test_sched
test failed , is it failed due to the code I changed ? Somehow I failed to pass test_sched
on local even without the code change.
So sometimes the scheduler can fail to load depending on what version of clang was used to compile the BPF program. Your code of course looks totally fine, so it might just be that for some reason the verifier is failing to correctly prune branches. One thing that sticks out is that we're doing an extra lookup of taskc->cpumask
on lines 755-759 which we should probably remove. Can we add that to this diff and see if it helps? If not I can loop in a BPF person to help us understand why the verifier doesn't like this.
So sometimes the scheduler can fail to load depending on what version of clang was used to compile the BPF program. Your code of course looks totally fine, so it might just be that for some reason the verifier is failing to correctly prune branches. One thing that sticks out is that we're doing an extra lookup of
taskc->cpumask
on lines 755-759 which we should probably remove. Can we add that to this diff and see if it helps? If not I can loop in a BPF person to help us understand why the verifier doesn't like this.
@Byte-Lab - It doesn't worked either after line 755~759 is removed. Probably need a further check , PR #365 also failed to pass the check and that PR contains only fixing typos.
Ok, I was able to reproduce this locally. Will debug and get back to you.
With this PR, your PR should now load fine: https://github.com/sched-ext/scx/pull/366/files.
Once that lands, you can rebase and repush and your change should work great (it worked fine locally).
@Byte-Lab -
Yeah it works fine now, thanks for the help. Mind if I ask how did you identify the bug ?
I saw the error message was something about missing opts.dump()
in the action detail, how did you find out the problem is within dom_xfer_task()
?
Mind if I ask how did you identify the bug ?
Just experience with BPF :-) Sometimes the verifier will get confused and think there's an infinite loop due to not pruning branches correctly. I've seen this issue trigger in some form or another multiple times with dom_xfer_task()
; especially in the ravg code. By making dom_xfer_task()
a global subprog instead of a local subprog, BPF doesn't walk into that subprog from the caller during verification, and instead treats it as totally separate. This cuts down on the number of possible branches / complexity it has to account for significantly, but the downside is that you have to do annoying stuff like pass pids and do lookups for all of the trusted pointers you want to get which you already had in the caller. the BPF folks are looking into why the verifier was getting confused, so we might be able to revert https://github.com/sched-ext/scx/pull/366 at some point.
I saw the error message was something about missing opts.dump()
This is a red herring, not relevant to the failure. we should maybe silence this as it may confuse people unnecessarily :-(
Summary
Modify the execution sequence before lookup operation for
new_domc
. Ifnew_dom_id == NO_DOM_FOUND
, lookup operation fornew_domc
is definitely going to fail so we don't have to wait until we found thatnew_domc
isNULL
, clearing of cpumask and return operation should be done directly in that case.Plus we should avoid using
try_lookup_dom_ctx
outside the context oflookup_dom_ctx
, as it can keep the interface's consistency.