kretprobes are mysteriously missed

oseiberts11 commented 4 years ago

I have a BCC / python script that aims to measure the time that is taken by FUSE I/O requests. For this purpose it sets a kprobe and a kretprobe on (among others, but this the one I have trouble with), fuse_file_write_iter. I set the kretprobe before the kprobe:

        maxactive = multiprocessing.cpu_count() * 5
        b.attach_kretprobe(event="fuse_file_write_iter", fn_name="trace_write_return", maxactive=maxactive)
        b.attach_kprobe(event="fuse_file_write_iter", fn_name="trace_write_entry")

However is seems that the kretprobe is called significantly less often than the kprobe, which causes my code (which counts enters and exits) to think that some I/O requests are ongoing forever.

I see this from these counts in /sys/kernel/debug/tracing/kprobe_profile:

/sys/kernel/debug/tracing# cat kprobe_profile|grep 33922|sort -k 2 -n
  r_fuse_file_write_iter_bcc_33922                    74381756               0
  p_fuse_file_write_iter_bcc_33922                    74382094               0

As you can see, the number of caught returns is 338 lower than the number of enters, but nmissed is 0. So the fact that events are lost is lost too. Something is very unreliable here. What?

As a second question: I understand from reading about the implementation of kretprobes that a kretprobe is actually a special kprobe which causes the return from the function to trigger some code as well. You can even have code of your own at the entry, as I understand from https://www.kernel.org/doc/Documentation/kprobes.txt (section retprobe entry-handler). But it seems that this facility is not available from python-bcc. It would be exellently suited to ensure that entries and exits from a function balance properly. By splitting that up in 2 probes, the guarantee is lost, and it is no doubt less efficient.

yonghong-song commented 4 years ago

Which kernel are you using? The maxactive may not support in earlier kernels. You can try with /sys/kernel/debug/tracing/kprobe_events to see whether it really supports or not.

For your second question and link, you refer to internal kernel implementation about "have code of your own at the entry". bpf k[ret]probe support should guarantee the kretprobe execution if

kretprobe not missing it, (e.g., large enough maxactive)
bpf program running not stack on top of another bpf program, which is unlikely in your case.

Maybe you can share your whole program here so people can help reproduce and investigate.

oseiberts11 commented 4 years ago

Hi @yonghong-song, thanks for your reply.

I first added the maxactive parameter in the hope that it would help. The kernel version is (from uname -a) Linux cbk130619 4.15.0-70-generic #79~16.04.1-Ubuntu SMP Tue Nov 12 14:01:10 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Going by the version number 4.15, this would indicate that the maxactive parameter should work, right? But on the other hand, I found some code in src/cc/libbpf.c which tries to check if maxactive is supported. It checks file names that appear in /sys/kernel/debug/tracing/%s_events. If I check those afterwards, the names with the maxactive value are not there. So that seems to indicate that this kernel doesn't support it.

But when I found the data from /sys/kernel/debug/tracing/kprobe_profile that I show above, it tells me that worrying about maxactive is unneeded, since the counts of how often a kretprobe was missed are all 0.

But what you say about "bpf program running not stack on top of another bpf program" sounds possibly interesting. Looking at /sys/kernel/debug/tracing/kprobe_profile and in /sys/kernel/debug/tracing/events/kprobes/ shows me many bcc probes from processes that have terminated already, but the probes still show up there. I suspect that these were instances of the same program that were interrupted, and didn't clean up. I don't know which resources are still really present, and affect the current probes.

Is there an easy way to clean up those probes?

As about showing the program, I'll ask around. Basically it is an extended version of bcc/tools/xfsslower.py and ext4slower.py, and the extended in the same way for fuse.

oseiberts11 commented 4 years ago

I am attaching my program: prometheus-ebpf-exporter.py.txt

Like I said, it is based on ext4slower.py etc, adding code for FUSE file systems. It also adds timing async I/O requests (but the problem is with the sync I/O requests only at the moment), histograms, and a Prometheus exporter to show it all to the outside.

Here is a graph that shows how the exported value ebpf_fuse_req_sync_outstanding increases a lot at some point and doesn't return to 0. This value corresponds to activerequests element GAUGE_SYNC. It is modified by calling inc_gauge(GAUGE_SYNC) and dec_gauge(GAUGE_SYNC). Screenshot 2020-03-19 at 13 23 36 It could of course be really true that several 100s of I/O requests from that time never completed, but it seems really unlikely. The problem occurs on several machines, but not all. I have not found any particular event that seems to be related, just that it seems to happen when there is more than average I/O going on. But that certainly doesn't always trigger it.

oseiberts11 commented 4 years ago

I cleaned up all extra reported kprobes, then I observed it over the weekend. The problem seemed at first not to occur again, but then it happened again on one host this morning, a difference of 70 invocations between the kprobe and the corresponding kretprobe:

r_fuse_file_write_iter_bcc_39634                     1982721               0
p_fuse_file_write_iter_bcc_39634                     1982791               0

I used this little helper script to clean up:

pids=$(pgrep -d"|" -f /usr/local/sbin/prometheus-ebpf-exporter)
FILE=/sys/kernel/debug/tracing/kprobe_events
grep _bcc_ $FILE | egrep -v "_bcc_($pids) " | sed -e "s/^[rp]/-/" -e "s/ .*//" >> $FILE

yonghong-song commented 4 years ago

The following is a hypothesis by looking at linux/arch/x86/kernel/kprobes/core.c file. For kprobe, https://github.com/torvalds/linux/blob/master/arch/x86/kernel/kprobes/core.c#L681-L699

        if (kprobe_running()) {
            if (reenter_kprobe(p, regs, kcb))
                return 1;
        } else {
            set_current_kprobe(p, regs, kcb);
            kcb->kprobe_status = KPROBE_HIT_ACTIVE;

            /*
             * If we have no pre-handler or it returned 0, we
             * continue with normal processing.  If we have a
             * pre-handler and it returned non-zero, that means
             * user handler setup registers to exit to another
             * instruction, we must skip the single stepping.
             */
            if (!p->pre_handler || !p->pre_handler(p, regs))
                setup_singlestep(p, regs, kcb, 0);
            else
                reset_current_kprobe();
            return 1;

For kretprobe, https://github.com/torvalds/linux/blob/master/arch/x86/kernel/kprobes/core.c#L847-L872

    correct_ret_addr = ri->ret_addr;
    hlist_for_each_entry_safe(ri, tmp, head, hlist) {
        if (ri->task != current)
            /* another task is sharing our hash bucket */
            continue;
        if (ri->fp != frame_pointer)
            continue;

        orig_ret_address = (unsigned long)ri->ret_addr;
        if (ri->rp && ri->rp->handler) {
            __this_cpu_write(current_kprobe, &ri->rp->kp);
            ri->ret_addr = correct_ret_addr;
            ri->rp->handler(ri, regs);
            __this_cpu_write(current_kprobe, &kretprobe_kprobe);
        }

        recycle_rp_inst(ri, &empty_rp);

        if (orig_ret_address != trampoline_address)
            /*
             * This is the real return address. Any other
             * instances associated with this task are for
             * other calls deeper on the call stack
             */
            break;
    }

There are a few more conditions may cause kretprobe skipped. These conditions might be triggerred in your case. Unfortunately, to really prove this, we need to build a custom kernel.

oseiberts11 commented 4 years ago

@yonghong-song That is an interesting observation. Too bad that I also can't see a way to experiment with this easily. Of course, the problem doesn't happen reliably so this would be very difficult to reproduce in a test environment.

iovisor / bcc

kretprobes are mysteriously missed #2825