sched-ext / scx

sched_ext schedulers and tools
https://bit.ly/scx_slack
GNU General Public License v2.0
838 stars 78 forks source link

scx_rusty: Refactor lookup operation for new_domc in task_set_domain #362

Closed vax-r closed 3 months ago

vax-r commented 3 months ago

Summary

Modify the execution sequence before lookup operation for new_domc. If new_dom_id == NO_DOM_FOUND, lookup operation for new_domc is definitely going to fail so we don't have to wait until we found that new_domc is NULL, clearing of cpumask and return operation should be done directly in that case.

Plus we should avoid using try_lookup_dom_ctx outside the context of lookup_dom_ctx, as it can keep the interface's consistency.

vax-r commented 3 months ago

The test_sched test failed , is it failed due to the code I changed ? Somehow I failed to pass test_sched on local even without the code change.

Byte-Lab commented 3 months ago

So sometimes the scheduler can fail to load depending on what version of clang was used to compile the BPF program. Your code of course looks totally fine, so it might just be that for some reason the verifier is failing to correctly prune branches. One thing that sticks out is that we're doing an extra lookup of taskc->cpumask on lines 755-759 which we should probably remove. Can we add that to this diff and see if it helps? If not I can loop in a BPF person to help us understand why the verifier doesn't like this.

vax-r commented 3 months ago

So sometimes the scheduler can fail to load depending on what version of clang was used to compile the BPF program. Your code of course looks totally fine, so it might just be that for some reason the verifier is failing to correctly prune branches. One thing that sticks out is that we're doing an extra lookup of taskc->cpumask on lines 755-759 which we should probably remove. Can we add that to this diff and see if it helps? If not I can loop in a BPF person to help us understand why the verifier doesn't like this.

@Byte-Lab - It doesn't worked either after line 755~759 is removed. Probably need a further check , PR #365 also failed to pass the check and that PR contains only fixing typos.

Byte-Lab commented 3 months ago

Ok, I was able to reproduce this locally. Will debug and get back to you.

Byte-Lab commented 3 months ago

With this PR, your PR should now load fine: https://github.com/sched-ext/scx/pull/366/files.

Once that lands, you can rebase and repush and your change should work great (it worked fine locally).

vax-r commented 3 months ago

@Byte-Lab - Yeah it works fine now, thanks for the help. Mind if I ask how did you identify the bug ? I saw the error message was something about missing opts.dump() in the action detail, how did you find out the problem is within dom_xfer_task() ?

Byte-Lab commented 3 months ago

Mind if I ask how did you identify the bug ?

Just experience with BPF :-) Sometimes the verifier will get confused and think there's an infinite loop due to not pruning branches correctly. I've seen this issue trigger in some form or another multiple times with dom_xfer_task(); especially in the ravg code. By making dom_xfer_task() a global subprog instead of a local subprog, BPF doesn't walk into that subprog from the caller during verification, and instead treats it as totally separate. This cuts down on the number of possible branches / complexity it has to account for significantly, but the downside is that you have to do annoying stuff like pass pids and do lookups for all of the trusted pointers you want to get which you already had in the caller. the BPF folks are looking into why the verifier was getting confused, so we might be able to revert https://github.com/sched-ext/scx/pull/366 at some point.

I saw the error message was something about missing opts.dump()

This is a red herring, not relevant to the failure. we should maybe silence this as it may confuse people unnecessarily :-(