scx_layered: Add per-layer sched-delay metric (and maybe others too)

htejun commented 3 weeks ago

As scheduler behavior metrics can vary widely across different layers, system level metrics aren't that useful in understanding how the scheduler is behaving. Add more per-layer metrics including per-layer scheduling delay metric.

minosfuture commented 3 days ago

hi @htejun, I'm interested in working on this. I'm trying to get familiar with the repo by helping with good "first" issues.

I'm assuming there is no existing API that provides the scheduling delay metric for a task, so it needs to be implemented at per-scheduler basis? It'll be the accumulated delay between enqueued and running, if I understand correctly.

Is there an example implementation in other schedulers that I can refer to? If not, are we interested in adding such metric for other schedulers as well (or we will just rely on kernel's overall scheduling delay metric)?

htejun commented 3 days ago

The information is useful for all schedulers but overall system metrics are easily observable with e.g. btftrace or bcc tools. How the metrics should be aggregated would depend on the specifc scheduler - e.g. scx_layered needs to collect the metrics per layer. scx_bpfland would probably want to aggregate depending on whether the task is classified interactive or not and so on. One altnerative approach could be coming up with a shared way of "tagging" tasks so that generic BPF tool can aggregate the numbers according to the tags.

We can start with baked-in impelmentation is each scheduler. I'd measure the durations whenever the task is runnable but not running - ie. ops.runnable() to ops.running() transition durations and ops.stopping() to the subsequent ops.running() transitions.

sched-ext / scx

scx_layered: Add per-layer sched-delay metric (and maybe others too) #637