Closed brendangregg closed 8 years ago
On Fri, Jul 15, 2016 at 6:33 PM, Brendan Gregg notifications@github.com wrote:
+#define CREATE_TRACE_POINTS+#include <trace/events/perf.h>+ typedef int (remote_function_f)(void );
struct remote_function_call {@@ -8036,6 +8039,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) perf_sample_data_init(&data, 0, event->hw.last_period); regs = get_irq_regs();
- trace_perf_hrtimer(regs);+ if (regs && !perf_exclude_event(event, regs)) { if (!(event->attr.exclude_idle && is_idle_task(current))) if (__perf_event_overflow(event, 1, &data, regs))
Ideally we'd have BPF sampling support, but I think this is a reasonable and useful workaround in the meantime. @ast https://github.com/ast, what do you think?
makes sense to me. Please run 'checkpatch' on the patch before submitting to make sure there are no spaces instead of tabs. Probably cc Arnaldo and Wang. I think they will be interested in it as well. Here's the Python program for the profiler:
Use the -s option to skip the appropriate number of frames for your system.# It defaults to 3. Eg, the above sample stack is really the idle thread, and# so -s 11 can be used to skip the top 11 lines.
that kinda sux. how perf deals with these extra frames? or it doesn't seem them?# use perf tracepoint if it exists, else kprobe
if os.path.exists("/sys/kernel/debug/tracing/events/perf/perf_hrtimer"): bpf_text = bpf_text.replace('PERF_TRACE_EVENT', 'TRACEPOINT_PROBE(perf, perf_hrtimer)')else: bpf_text = bpf_text.replace('PERF_TRACE_EVENT', 'int kprobe__perf_misc_flags(struct pt_regs *args)')
do you see the same extra frames with both kprobe and tracepoint approach? kprobe will be firing a lot more often. Including during bpf_perf_event_output and our recursion prevention check will do the right thing, but it's unnecessary overhead, so tracepoint is definitely cleaner.
Thanks, will run checkpatch (I'm reading Documentation/SubmittingPatches). Easy to mess up when switching between Python/C.
I'll check how perf deals with it. So it's a different number of frames between the tracepoint (11) and kprobe (14), but I think this will also vary based on the system, whether it's a Xen guest (see that sample stack) or not, etc.
The code in check_stack() seems to scan the stack trace for an ip match:
/* Skip over the overhead of the stack tracer itself */
for (i = 0; i < stack_trace_max.nr_entries; i++) {
if (stack_dump_trace[i] == ip)
break;
}
However, that doesn't work. I fetched the ip using the regs argument of my perf:perf_hrtimer tracepoint, and printed it after the kernel stack. Some code, and example output:
struct pt_regs regs = {};
bpf_probe_read(®s, sizeof(regs), args->regs);
key.ip = PT_REGS_IP(®s);
ffffffff81175094 perf_swevent_hrtimer
ffffffff810e6bc4 __hrtimer_run_queues
ffffffff810e7338 hrtimer_interrupt
ffffffff81022f29 xen_timer_interrupt
ffffffff810d2b82 handle_irq_event_percpu
ffffffff810d651a handle_percpu_irq
ffffffff810d2192 generic_handle_irq
ffffffff814a5797 evtchn_2l_handle_events
ffffffff814a2eb3 __xen_evtchn_do_upcall
ffffffff814a4da0 xen_evtchn_do_upcall
ffffffff817cdf0c xen_hvm_callback_vector
ffffffff8136d5f8 apparmor_file_permission
ffffffff81330fdd security_file_permission
ffffffff8120c25e rw_verify_area
ffffffff8120c477 vfs_write
ffffffff8120d7f6 sys_write
ffffffff817cc036 entry_SYSCALL_64_fastpath
IP: ffffffff8136d4c6 common_file_perm
ip is common_file_perm(), but that isn't in the stack trace. It's parent, apparmor_file_permission(), is. I see the same pattern in many other examples.
It looks like the interrupt isn't pushing a stack frame. So how does perf even work by scanning for the ip in the stack, if it's not in the stack? Maybe check_stack() is the wrong code, or it's fixed the stack before then (using saved regs and a saved stack depth?).
If I look at the ret instruction from the stack instead (RBP+8), I do find it in the stack trace:
struct pt_regs regs = {};
bpf_probe_read(®s, sizeof(regs), args->regs);
bpf_probe_read(&key.ip, sizeof(key.ip), (void *)(regs.bp + 8));
ffffffff81175094 perf_swevent_hrtimer
ffffffff810e6bc4 __hrtimer_run_queues
ffffffff810e7338 hrtimer_interrupt
ffffffff81022f29 xen_timer_interrupt
ffffffff810d2b82 handle_irq_event_percpu
ffffffff810d651a handle_percpu_irq
ffffffff810d2192 generic_handle_irq
ffffffff814a5797 evtchn_2l_handle_events
ffffffff814a2eb3 __xen_evtchn_do_upcall
ffffffff814a4da0 xen_evtchn_do_upcall
ffffffff817cdf0c xen_hvm_callback_vector
ffffffff8136d5f8 apparmor_file_permission
ffffffff81330fdd security_file_permission
ffffffff8120c25e rw_verify_area
ffffffff8120c477 vfs_write
ffffffff8120d7f6 sys_write
ffffffff817cc036 entry_SYSCALL_64_fastpath
IP: ffffffff8136d5f8 apparmor_file_permission
Ok, so that's in the stack, and this seems reliable. I suppose I should enhance it to also put both ip and rbp+8 in the map key, the former as the first frame, and the latter for skipping interrupt framework frames.
So there'd be at least two ways forward:
Still would like to know why ip isn't in the stack...
Would it make sense to enable the perf event by calling perf_event_open
instead of invoking perf directly? Could be a bit less brittle. I think I have an example of doing this from Python in tracepoint.py.
Also, what is the overhead of running perf in this mode with no output file? Are there still events written to a cyclic buffer and then flushed to /dev/null?
check_stack() and bits in trace_stack.c are used by ftrace infra to look for large stacks. It's gated by CONFIG_STACK_TRACER and can be enabled by /sys/kernel/debug/tracing/current_tracer. It's very specific to ftrace. perf_event logic doesn't call into it.
Also completely agree with @goldshtn Doing manual perf_event_open is much faster and more stable. perf flags and defaults may change, whereas perf_event_open is fixed kernel abi.
I've switched to perf_event_open (thanks @goldshtn for the tracepoint.py code), and it should be more efficient now: since I'm not setting up the perf ring buffers on the perf event file descriptors using mmap, nor reading from them, there's now no perf event data being read into user space or written to the file system, as there was with the perf command.
Here's before (perf cmd):
bcc/tools# ./funccount.py 'perf*'
Tracing 54 functions for "perf*"... Hit Ctrl-C to end.
^C
ADDR FUNC COUNT
ffffffff81181870 perf_event_max_stack_handler 2
ffffffff811812f0 perf_mmap_to_page 8
ffffffff81180790 perf_output_begin 12
ffffffff811801d0 perf_mmap_free_page 1041
ffffffff81180150 perf_mmap_alloc_page 1176
ffffffff811803d0 perf_output_begin_forward 3387
ffffffff81007ab0 perf_misc_flags 3391
ffffffff81007a40 perf_instruction_pointer 3392
ffffffff81180a30 perf_output_end 3397
ffffffff811800e0 perf_output_put_handle 3401
ffffffff81180340 perf_output_copy 20393
Detaching...
Here's after (perf_event_open):
bcc/tools# ./funccount.py 'perf*'
Tracing 54 functions for "perf*"... Hit Ctrl-C to end.
^C
ADDR FUNC COUNT
ffffffff81180790 perf_output_begin 5
ffffffff811801d0 perf_mmap_free_page 9
ffffffff81180150 perf_mmap_alloc_page 144
ffffffff81007ab0 perf_misc_flags 3587
ffffffff811803d0 perf_output_begin_forward 3587
Detaching...
(You can see why I was using perf_misc_flags as the fallback.)
I think this all works because perf_mmap() hasn't been called, so event->rb is not populated, and we exit early out of __perf_output_begin(). I'm glad that the PERF_EVENT_IOC_ENABLE call is successful, despite no ring buffer setup.
So this has ended up much more efficient than I was expecting. Great for us.
Here's a prototype CPU profiler:
Arguments taken from offcputime:
It works by running perf to initialize a software timer, and then using BPF to instrument the events.
If a perf:perf_hrtimer tracepoint exists, it uses that, otherwise it defaults to kprobe of perf_misc_flags(). Here's the code for that tracepoint:
Ideally we'd have BPF sampling support, but I think this is a reasonable and useful workaround in the meantime. @ast, what do you think?
It could be made a little more efficient by adding a perf option to not emit sample records, but fire its timers anyway. perf does have --no-samples, but that avoids the timer as well.
Here's the Python program for the profiler: