Open oseiberts11 opened 4 years ago
Which kernel are you using? The maxactive may not support in earlier kernels. You can try with /sys/kernel/debug/tracing/kprobe_events to see whether it really supports or not.
For your second question and link, you refer to internal kernel implementation about "have code of your own at the entry". bpf k[ret]probe support should guarantee the kretprobe execution if
Maybe you can share your whole program here so people can help reproduce and investigate.
Hi @yonghong-song, thanks for your reply.
I first added the maxactive parameter in the hope that it would help. The kernel version is (from uname -a) Linux cbk130619 4.15.0-70-generic #79~16.04.1-Ubuntu SMP Tue Nov 12 14:01:10 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Going by the version number 4.15, this would indicate that the maxactive
parameter should work, right? But on the other hand, I found some code in src/cc/libbpf.c
which tries to check if maxactive
is supported. It checks file names that appear in /sys/kernel/debug/tracing/%s_events
. If I check those afterwards, the names with the maxactive
value are not there. So that seems to indicate that this kernel doesn't support it.
But when I found the data from /sys/kernel/debug/tracing/kprobe_profile
that I show above, it tells me that worrying about maxactive is unneeded, since the counts of how often a kretprobe was missed are all 0.
But what you say about "bpf program running not stack on top of another bpf program" sounds possibly interesting. Looking at /sys/kernel/debug/tracing/kprobe_profile
and in /sys/kernel/debug/tracing/events/kprobes/
shows me many bcc probes from processes that have terminated already, but the probes still show up there. I suspect that these were instances of the same program that were interrupted, and didn't clean up. I don't know which resources are still really present, and affect the current probes.
Is there an easy way to clean up those probes?
As about showing the program, I'll ask around. Basically it is an extended version of bcc/tools/xfsslower.py
and ext4slower.py
, and the extended in the same way for fuse
.
I am attaching my program: prometheus-ebpf-exporter.py.txt
Like I said, it is based on ext4slower.py
etc, adding code for FUSE file systems. It also adds timing async I/O requests (but the problem is with the sync I/O requests only at the moment), histograms, and a Prometheus exporter to show it all to the outside.
Here is a graph that shows how the exported value ebpf_fuse_req_sync_outstanding
increases a lot at some point and doesn't return to 0. This value corresponds to activerequests
element GAUGE_SYNC
. It is modified by calling inc_gauge(GAUGE_SYNC)
and dec_gauge(GAUGE_SYNC)
.
It could of course be really true that several 100s of I/O requests from that time never completed, but it seems really unlikely. The problem occurs on several machines, but not all. I have not found any particular event that seems to be related, just that it seems to happen when there is more than average I/O going on. But that certainly doesn't always trigger it.
I cleaned up all extra reported kprobes, then I observed it over the weekend. The problem seemed at first not to occur again, but then it happened again on one host this morning, a difference of 70 invocations between the kprobe and the corresponding kretprobe:
r_fuse_file_write_iter_bcc_39634 1982721 0
p_fuse_file_write_iter_bcc_39634 1982791 0
I used this little helper script to clean up:
pids=$(pgrep -d"|" -f /usr/local/sbin/prometheus-ebpf-exporter)
FILE=/sys/kernel/debug/tracing/kprobe_events
grep _bcc_ $FILE | egrep -v "_bcc_($pids) " | sed -e "s/^[rp]/-/" -e "s/ .*//" >> $FILE
The following is a hypothesis by looking at linux/arch/x86/kernel/kprobes/core.c file. For kprobe, https://github.com/torvalds/linux/blob/master/arch/x86/kernel/kprobes/core.c#L681-L699
if (kprobe_running()) {
if (reenter_kprobe(p, regs, kcb))
return 1;
} else {
set_current_kprobe(p, regs, kcb);
kcb->kprobe_status = KPROBE_HIT_ACTIVE;
/*
* If we have no pre-handler or it returned 0, we
* continue with normal processing. If we have a
* pre-handler and it returned non-zero, that means
* user handler setup registers to exit to another
* instruction, we must skip the single stepping.
*/
if (!p->pre_handler || !p->pre_handler(p, regs))
setup_singlestep(p, regs, kcb, 0);
else
reset_current_kprobe();
return 1;
For kretprobe, https://github.com/torvalds/linux/blob/master/arch/x86/kernel/kprobes/core.c#L847-L872
correct_ret_addr = ri->ret_addr;
hlist_for_each_entry_safe(ri, tmp, head, hlist) {
if (ri->task != current)
/* another task is sharing our hash bucket */
continue;
if (ri->fp != frame_pointer)
continue;
orig_ret_address = (unsigned long)ri->ret_addr;
if (ri->rp && ri->rp->handler) {
__this_cpu_write(current_kprobe, &ri->rp->kp);
ri->ret_addr = correct_ret_addr;
ri->rp->handler(ri, regs);
__this_cpu_write(current_kprobe, &kretprobe_kprobe);
}
recycle_rp_inst(ri, &empty_rp);
if (orig_ret_address != trampoline_address)
/*
* This is the real return address. Any other
* instances associated with this task are for
* other calls deeper on the call stack
*/
break;
}
There are a few more conditions may cause kretprobe skipped. These conditions might be triggerred in your case. Unfortunately, to really prove this, we need to build a custom kernel.
@yonghong-song That is an interesting observation. Too bad that I also can't see a way to experiment with this easily. Of course, the problem doesn't happen reliably so this would be very difficult to reproduce in a test environment.
I have a BCC / python script that aims to measure the time that is taken by FUSE I/O requests. For this purpose it sets a kprobe and a kretprobe on (among others, but this the one I have trouble with),
fuse_file_write_iter
. I set the kretprobe before the kprobe:However is seems that the kretprobe is called significantly less often than the kprobe, which causes my code (which counts enters and exits) to think that some I/O requests are ongoing forever.
I see this from these counts in
/sys/kernel/debug/tracing/kprobe_profile
:As you can see, the number of caught returns is 338 lower than the number of enters, but
nmissed
is0
. So the fact that events are lost is lost too. Something is very unreliable here. What?As a second question: I understand from reading about the implementation of kretprobes that a kretprobe is actually a special kprobe which causes the return from the function to trigger some code as well. You can even have code of your own at the entry, as I understand from
https://www.kernel.org/doc/Documentation/kprobes.txt
(section retprobe entry-handler). But it seems that this facility is not available from python-bcc. It would be exellently suited to ensure that entries and exits from a function balance properly. By splitting that up in 2 probes, the guarantee is lost, and it is no doubt less efficient.