iovisor / bcc

BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more
Apache License 2.0
20.41k stars 3.86k forks source link

runqlen.py tool output is weird compared to runqlen_example.txt #1565

Open joelagnel opened 6 years ago

joelagnel commented 6 years ago

Something seems a bit off to me with the runqlen tool. I tested this on a 44 core x86 machine, the runqlen is always 0.

./tools/runqlen.py 1

     runqlen       : count     distribution
        0          : 4745     |****************************************|

     runqlen       : count     distribution
        0          : 4753     |****************************************|
...

Tried it on a 4-core machine as well with make -j8 running on bcc. I get a similar result. This seems odd to me, higher order runqlen don't appear. Any thoughts on why this is so?

yonghong-song commented 6 years ago

Don't know the reason, but you can check the source code to see whether cfs_rq_partial matches your kernel source or not. If it does not, then wrong results will be printed.

joelagnel commented 6 years ago

Thanks for the reply. So I'm running kernel 4.14-rc5. cfs_rq and sched_entity both don't have runnable_weight. However, check_runnable_weight_field() still returns True. Forcing it to return False doesn't change anything either and I still see runqlen is 0 as before. Seems the detection mechanism is not working correctly, or its some other issue.

yonghong-song commented 6 years ago

I tried on 4.14-rc5. I checked that check_runnable_weight_field() does return false so the implementation here is correct.

To make runqlen() useful, you may need to run it in a busy system: (1). runqlen() does not sampling, 99 times per second. (2). runqlen() checks the CURRENT task running queue length. If the system is not overloaded, the kernel will be able to scheduce "runqlen.py" process in a cpu without competition and you will see a length of 0.

joelagnel commented 6 years ago

Thanks for trying it out. I am not fully sure why I was seeing this, for now I am developing on 4.15 kernel and its showing expected results, if I see it on our older product kernels, I will report/fix it. Also the system was overloaded when I ran into the issue, I was running make in bcc with multiple threads on a 4-core system and it showed only 0 as the run queue length. I will close this for now and reopen if needed.

yonghong-song commented 6 years ago

Tried again on 4.14-rc5. The same workload, although light, produced non-zero runqlen on 4.15-rc7, but zero runqlen on 4.14. So it does look suspicious.

joelagnel commented 6 years ago

Can we not use rq->nr_running ? It has the following benefits:

  1. It tracks not only CFS but also RT run queue length since its sum of all queued tasks
  2. It should be simpler to use (unlike cfs_rq) since every kernel version I checked has nr_running followed after rq->lock:

    struct rq {
    /* runqueue lock: */
    raw_spinlock_t lock;
    
    /*
     * nr_running and cpu_load should be in the same cacheline because
     * remote CPUs use both these fields when doing load calculation.
     */
    unsigned int nr_running;
yonghong-song commented 6 years ago

Now I know what is the issue with my previous 4.14 suspicious experiments. It is due to randomized task_struct structure in 4.14 for not in bpf program. The following hack can solve the issue:

diff --git a/tools/runqlen.py b/tools/runqlen.py
index e8430ca..5559297 100755
--- a/tools/runqlen.py
+++ b/tools/runqlen.py
@@ -79,6 +79,8 @@ frequency = 99
 def check_runnable_weight_field():
     # Define the bpf program for checking purpose
     bpf_check_text = """
+#define randomized_struct_fields_start  struct {
+#define randomized_struct_fields_end    };
 #include <linux/sched.h>
 unsigned long dummy(struct sched_entity *entity)
 {
@@ -108,6 +110,8 @@ unsigned long dummy(struct sched_entity *entity)
     dup(old_stderr)
     close(old_stderr)

+    print(success_compile)
+
     # remove the temporary file and return
     unlink(tmp_file.name)
     return success_compile
@@ -116,6 +120,8 @@ unsigned long dummy(struct sched_entity *entity)
 # define BPF program
 bpf_text = """
 #include <uapi/linux/ptrace.h>
+#define randomized_struct_fields_start  struct {
+#define randomized_struct_fields_end    };
 #include <linux/sched.h>

 // Declare enough of cfs_rq to find nr_running, since we can't #import the

Now the result becomes consistent with 4.15.

Regarding to whether we should change the examination point for nr_running. I assume you refer to kernel/sched/sched.h:

/* CFS-related fields in a runqueue */
struct cfs_rq {
   ......
#ifdef CONFIG_FAIR_GROUP_SCHED
        struct rq *rq;  /* cpu runqueue to which this cfs_rq is attached */
......
}

Let me discuss with some scheduler experts to see which is the better sampling place. Thanks for bringing up the suggestions!

josefbacik commented 6 years ago

rq->nr_running is more what you want, since cgroups make cfs_rq->nr_running a lot more interesting.