iovisor / bcc

BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more
Apache License 2.0
20.36k stars 3.86k forks source link

Kernel stack traces #4870

Closed Anjali05 closed 7 months ago

Anjali05 commented 8 months ago

Hi,

I am working on analyzing some of the kernel stack traces that led to a lock acquisition. The goal is to dynamically profile a workload to see which locks are acquired during its execution. I have a modified klockstat script that outputs lockname along with lock addresses, and stack traces. While the stack traces for some locks are quite accurate but for some it seems to produce inaccurate/incomplete stack. Is there any reason that will lead to wrong stack traces? Here is an example of such a trace.

"locks": [ "freezer_mutex" ], "kernel_stack": [ "mutex_lock_nested+0x5 [kernel]", "cgroup_post_fork+0x151 [kernel]", "copy_process+0x15af [kernel]", "kernel_clone+0x9d [kernel]", "do_sys_clone+0x66 [kernel]", "x64_sys_clone+0x25 [kernel]", "do_syscall_64+0x38 [kernel]", "entry_SYSCALL_64_after_hwframe+0x63 [kernel]", "" ], "addresses": [ "0xffffffff81d40055", "0xffffffff811a6011", "0xffffffff810aa53f", "0xffffffff810aac4d", "0xffffffff810ab156", "0xffffffff810ab1a5", "0xffffffff81d2d148", "0xffffffff81e0009b" ] },

I have also tried searching the missing function calls (if any) manually made by the last func (cgroup_post_fork in this example) beforemutex_lock_nested but in vain.

Any help/insight into this would be greatly appreciated! Thanks!

yonghong-song commented 8 months ago

Some functions are missing maybe due to inlining. Maybe you want to check that?

Anjali05 commented 8 months ago

@yonghong-song I understand inlining might do that and it's happening for some traces. I have tried following the function call graph from the last function (cgroup_post_fork) for this particular example but it's not leading to freezer_mutex. Is it possible that the stack traces are overwritten somehow when the stackes having the same id. Say you have lock A with stacks stored at some bucket in the hasmap and then you have another lock B which hashes to the same bucket and overwrites the stack. Is it possible then when in my script I read the stack for lock A it's actually for lock B and not A? I might be wrong but I want to understand more about how get_stackid works. Thanks in advance!

yonghong-song commented 8 months ago

I checked the code. The reuse of stack id is disabled so it should not be overwritten by other stacks. But currently for kfunc, the skip is 3 and there are some bug fixes with skip: https://github.com/torvalds/linux/commit/ee2a098851bfbe8bcdd964c0121f4246f00ff41e not sure whether it is relevant or not. Maybe you want to try the following change to see whether you can get a meaningful stack trace or not?

diff --git a/tools/klockstat.py b/tools/klockstat.py
index 9599bfd3..b3652948 100755
--- a/tools/klockstat.py
+++ b/tools/klockstat.py
@@ -362,7 +362,7 @@ KRETFUNC_PROBE(mutex_lock, void *lock, int ret)

 KFUNC_PROBE(mutex_lock, void *lock)
 {
-    return do_mutex_lock_enter(ctx, 3);
+    return do_mutex_lock_enter(ctx, 0);
 }

 """
@@ -380,7 +380,7 @@ KRETFUNC_PROBE(mutex_lock_nested, void *lock, int ret)

 KFUNC_PROBE(mutex_lock_nested, void *lock)
 {
-    return do_mutex_lock_enter(ctx, 3);
+    return do_mutex_lock_enter(ctx, 0);
 }
Anjali05 commented 8 months ago

@yonghong-song Yeah I have tried that but it just adds the extra three frames which are not useful:

  "locks": [
      "freezer_mutex"
  ],
  "kernel_stack": [
      "bpf_prog_6deef7357e7b4530+0xa030 [bpf]",
      "bpf_prog_6deef7357e7b4530+0xa030 [bpf]",
      "coretemp_cpu_online.cold+0x52d4 [coretemp]",
      "mutex_lock_nested+0x5 [kernel]",
      "cgroup_post_fork+0x151 [kernel]",
      "copy_process+0x15af [kernel]",
      "kernel_clone+0x9d [kernel]",
      "__do_sys_clone+0x66 [kernel]",
      "__x64_sys_clone+0x25 [kernel]",
      "do_syscall_64+0x38 [kernel]",
      "entry_SYSCALL_64_after_hwframe+0x63 [kernel]",
      ""
  ],
  "addresses": [
      "0xffffffffa000e850",
      "0xffffffffa000e850",
      "0xffffffffa0633052",
      "0xffffffff81d40055",
      "0xffffffff811a6011",
      "0xffffffff810aa53f",
      "0xffffffff810aac4d",
      "0xffffffff810ab156",
      "0xffffffff810ab1a5",
      "0xffffffff81d2d148",
      "0xffffffff81e0009b"
  ]

I will try to debug more. Apart from inlining, is there any other optimization you are aware of that the compiler might be doing that limits the stack traces? I know Linux is a large code base and there are some code optimizations during compile time, just want to check if I am missing something here that might be very obvious.

yonghong-song commented 8 months ago

@yonghong-song Yeah I have tried that but it just adds the extra three frames which are not useful:

Okay, just drop the change. Yes, it is not useful.

  "locks": [
      "freezer_mutex"
  ],
  "kernel_stack": [
      "bpf_prog_6deef7357e7b4530+0xa030 [bpf]",
      "bpf_prog_6deef7357e7b4530+0xa030 [bpf]",
      "coretemp_cpu_online.cold+0x52d4 [coretemp]",
      "mutex_lock_nested+0x5 [kernel]",
      "cgroup_post_fork+0x151 [kernel]",
      "copy_process+0x15af [kernel]",
      "kernel_clone+0x9d [kernel]",
      "__do_sys_clone+0x66 [kernel]",
      "__x64_sys_clone+0x25 [kernel]",
      "do_syscall_64+0x38 [kernel]",
      "entry_SYSCALL_64_after_hwframe+0x63 [kernel]",
      ""
  ],
  "addresses": [
      "0xffffffffa000e850",
      "0xffffffffa000e850",
      "0xffffffffa0633052",
      "0xffffffff81d40055",
      "0xffffffff811a6011",
      "0xffffffff810aa53f",
      "0xffffffff810aac4d",
      "0xffffffff810ab156",
      "0xffffffff810ab1a5",
      "0xffffffff81d2d148",
      "0xffffffff81e0009b"
  ]

I will try to debug more. Apart from inlining, is there any other optimization you are aware of that the compiler might be doing that limits the stack traces? I know Linux is a large code base and there are some code optimizations during compile time, just want to check if I am missing something here that might be very obvious.

Not sure. Typically kernel has two modes to deal with stack unwinding, CONFIG_UNWINDER_ORC and CONFIG_UNWINDER_FRAME_POINTER. They should have the same functionality. But in case that you are using CONFIG_UNWINDER_ORC, you might want to try CONFIG_UNWINDER_FRAME_POINTER.

Anjali05 commented 8 months ago

@yonghong-song I am using CONFIG_UNWINDER_FRAME_POINTER. I will try building the kernel by passing -O2 optimization flag to GCC [update: I realized that the kernel is optimized with -O2 by default].

Anjali05 commented 8 months ago

@yonghong-song I did a little debugging with gdb and here is what I got: image

So there is an indirect call from cgroup_post_fork to freezer_fork that grabs the freezer_mutex lock. Do you have any thoughts on why freezer_fork won't show up in the traces that eBPF generates? It's not making much sense to me. I appreciate any help you can provide. Thanks!

yonghong-song commented 8 months ago

I also tried sudo klockstat.py -s 20, and check a few stacks, e.g., for

  b'entry_SYSCALL_64_after_hwframe+0x72'
                 b'btrfs_sync_log+0x637'        765      1        765        765
                b'btrfs_sync_file+0x427'
            b'btrfs_do_write_iter+0x31f'
                b'__x64_sys_write+0x26c'
                   b'do_syscall_64+0x45'

So we have __x86_sys_write -> btrfs_do_write_iter indirect call. But in reality, it should be __x86_sys_write->btrfs_file_write_iter->btrfs_do_write_iter. But since here from btrfs_file_write_iter->btrfs_do_write_iter is a tail call, so we are fine here.

I collected some other trace with trace.py:

3135181 3135274 ThriftIO8       __tcp_transmit_skb 
        __tcp_transmit_skb+0x1 [kernel]
        tcp_rcv_established+0xf8a [kernel]
        tcp_v6_do_rcv+0xfc [kernel]
        tcp_v6_rcv+0xf0e [kernel]
        ip6_input+0x2e5 [kernel]
        ipv6_rcv+0x4d [kernel]
        process_backlog+0xf7 [kernel]
        net_rx_action+0x12d [kernel]
        __do_softirq+0xd3 [kernel]
        __local_bh_enable_ip+0xb0 [kernel]
        __dev_queue_xmit+0x26c [kernel]
        ip6_output+0x5f0 [kernel]
        ip6_xmit+0x3ad [kernel]
        inet6_csk_xmit+0x224 [kernel]
        __tcp_transmit_skb+0x76e [kernel]
        __tcp_push_pending_frames+0x2da [kernel]
        tcp_sendmsg_locked+0xd8b [kernel]
        tcp_sendmsg+0x2d [kernel]
        __x64_sys_sendmsg+0x2c5 [kernel]
        do_syscall_64+0x45 [kernel]
        entry_SYSCALL_64_after_hwframe+0x72 [kernel]

which has a few callback functions and the stack look good to me. BTW, my system is using CONFIG_UNWINDER_ORC.

If you can create a reproducible case, for example, triggering a mutex lock in freezer_fork, I can help to do some investigation.

yonghong-song commented 8 months ago

Okay, I did some analysis on the freezer_fork() with 6.4 kernel,

ffffffff811caac0 <freezer_fork>:        
; {
ffffffff811caac0: f3 0f 1e fa           endbr64
ffffffff811caac4: 0f 1f 44 00 00        nopl    (%rax,%rax)
ffffffff811caac9: 53                    pushq   %rbx
;       return task_css_check(task, subsys_id, true) ==
ffffffff811caaca: 48 8b 87 70 0a 00 00  movq    0xa70(%rdi), %rax
ffffffff811caad1: 48 8b 40 30           movq    0x30(%rax), %rax
ffffffff811caad5: 48 3b 05 5c 88 5a 02  cmpq    0x25a885c(%rip), %rax   # 0xffffffff83773338 <init_css_set+0x30>
;       if (task_css_is_root(task, freezer_cgrp_id))
ffffffff811caadc: 75 06                 jne     0xffffffff811caae4 <freezer_fork+0x24>
; }
ffffffff811caade: 5b                    popq    %rbx
ffffffff811caadf: e9 9c cb d8 00        jmp     0xffffffff81f57680 <__x86_return_thunk>
ffffffff811caae4: 48 89 fb              movq    %rdi, %rbx
;       mutex_lock(&freezer_mutex);     
ffffffff811caae7: 48 c7 c7 10 52 77 83  movq    $-0x7c88adf0, %rdi      # imm = 0x83775210
ffffffff811caaee: e8 ed 56 d8 00        callq   0xffffffff81f501e0 <mutex_lock>
;       return task_css_check(task, subsys_id, false);
ffffffff811caaf3: 48 8b 83 70 0a 00 00  movq    0xa70(%rbx), %rax
ffffffff811caafa: 48 8b 40 30           movq    0x30(%rax), %rax
;       if (freezer->state & CGROUP_FREEZING)
ffffffff811caafe: f6 80 c8 00 00 00 06  testb   $0x6, 0xc8(%rax)
ffffffff811cab05: 74 08                 je      0xffffffff811cab0f <freezer_fork+0x4f>
;               freeze_task(task);
ffffffff811cab07: 48 89 df              movq    %rbx, %rdi
ffffffff811cab0a: e8 f1 90 31 00        callq   0xffffffff814e3c00 <freeze_task>
;       mutex_unlock(&freezer_mutex);   
ffffffff811cab0f: 48 c7 c7 10 52 77 83  movq    $-0x7c88adf0, %rdi      # imm = 0x83775210
ffffffff811cab16: 5b                    popq    %rbx
ffffffff811cab17: e9 24 57 d8 00        jmp     0xffffffff81f50240 <mutex_unlock>
ffffffff811cab1c: cc                    int3
ffffffff811cab1d: cc                    int3
ffffffff811cab1e: cc                    int3
ffffffff811cab1f: cc                    int3

In my code, I have mutex_lock and you have mutex_lock_nested. I guess you have CONFIG_DEBUG_LOCK_ALLOC enabled.

Any, from the below code,

ffffffff811caac0 <freezer_fork>:        
; {
ffffffff811caac0: f3 0f 1e fa           endbr64
ffffffff811caac4: 0f 1f 44 00 00        nopl    (%rax,%rax)
ffffffff811caac9: 53                    pushq   %rbx
;       return task_css_check(task, subsys_id, true) ==
ffffffff811caaca: 48 8b 87 70 0a 00 00  movq    0xa70(%rdi), %rax
ffffffff811caad1: 48 8b 40 30           movq    0x30(%rax), %rax
ffffffff811caad5: 48 3b 05 5c 88 5a 02  cmpq    0x25a885c(%rip), %rax   # 0xffffffff83773338 <init_css_set+0x30>
;       if (task_css_is_root(task, freezer_cgrp_id))
ffffffff811caadc: 75 06                 jne     0xffffffff811caae4 <freezer_fork+0x24>
; }     
ffffffff811caade: 5b                    popq    %rbx
ffffffff811caadf: e9 9c cb d8 00        jmp     0xffffffff81f57680 <__x86_return_thunk>
ffffffff811caae4: 48 89 fb              movq    %rdi, %rbx
;       mutex_lock(&freezer_mutex);
ffffffff811caae7: 48 c7 c7 10 52 77 83  movq    $-0x7c88adf0, %rdi      # imm = 0x83775210
ffffffff811caaee: e8 ed 56 d8 00        callq   0xffffffff81f501e0 <mutex_lock>

You can we we have pushq %rbx and then have pop %rbx before calling mutex_lock. This essentially a tail call from freezer_fork to mutex_lock. That is why you only see mutex_lock and not seeing freezer_fork.

Anjali05 commented 8 months ago

@yonghong-song Do you know of any way I can remove this optimization while building the kernel?

yonghong-song commented 7 months ago

The kernel needs to be rebuilt with additional flags like -fno-optimize-sibling-calls. Basically the compilation needs to disable tail call optimization. I didn't try with the flag though.

Anjali05 commented 7 months ago

@yonghong-song Thanks! I will give it a try.

Anjali05 commented 7 months ago

The kernel needs to be rebuilt with additional flags like -fno-optimize-sibling-calls. Basically the compilation needs to disable tail call optimization. I didn't try with the flag though.

@yonghong-song I did check this and it was already enabled when I built the kernel. Not sure what else to try.