Kernel tracepoint support (ugly)

goldshtn commented 8 years ago

I understand that adding kernel tracepoint support in BPF is going to take a while. In the meantime, would it be possible to put together a tool that relies on kprobes? What I had in mind was to enable the tracepoint using ftrace and then place a kprobe in the trace_event_raw_event_* function or in trace_event_buffer_commit, or some similar location. The kprobe would have access to the entry structure defined by TRACE_EVENT.

Is that a direction worth pursuing? Would it be possible (for efficiency) to instruct the ftrace infrastructure to discard the event and not even put it in the ftrace buffer?

4ast commented 8 years ago

:) I was hoping it won't get to it, but that's a reasonable 'workaround'. Probably enabling tracepoints via perf and adding kprobe to perftrace* will be a tiny bit better, since the code that enables tracepoint via perf_event_open will be reused when tracepoint+bpf is ready. but going via ftrace is also ok-ish. Both are ugly hacks, of course, and will be slower than normal kprobe at the beginning of the function, since trace_event_raw* and perf_trace* are 'notrace', so kprobe will do full 'int 3'. Also note that they will not have access to struct of trace_event, but rather 'proto' part of it only.

goldshtn commented 8 years ago

OK, suppose I go through perf - would it be possible to have perf discard the events and not write them out to perf.data? I'm looking to eliminate the inefficiency in doing analysis and filtering in user space, otherwise it isn't really worth pursuing...

4ast commented 8 years ago

if perf_event_open is done for the task, then perf_trace* will be called, but it will not produce any output, since pid != pid_of_our_python_process. Another approach is to add a filter to it that always returns false. It's a bit slower, since alloc + copy of arguments and registers will be happening before discarding event. The main overhead is int3 anyway.

goldshtn commented 8 years ago

OK, so just to make sure I understand the recommended approach. Turn on the tracepoints using perf record with a filter that always returns false so that no data actually goes into a file. Place a kprobe on one of the methods that process the perf event, such as perf_tp_event and try to capture data out of the provided record. Makes sense?

Also adding @brendangregg if you have any thoughts.

4ast commented 8 years ago

Turn tracepoint on with perf_event_open() syscall, add a filter via PERF_EVENT_IOC_SET_FILTER ioctl and add kprobe to corresponding perftrace* function. There are ~400 of them for each tracepoint. Like perf_trace_block_bio_complete. There will be no trace record. Only functions args in ctx->di, si ... Adding kprobe to perf_tp_event not really usable, since all events will go through it and record will be different and hard to parse.

goldshtn commented 8 years ago

I was kind of hoping to use perf and not having to replicate the whole "turn on the tracepoint and set a filter" functionality :-) That might be a bit out of scope for something like this, which is by definition temporary until BPF supports tracepoints... What do you think?

4ast commented 8 years ago

you mean to execute perf binary?! that's more complex and slower and fragile than doing a syscall.

goldshtn commented 8 years ago

Well I'm willing to try. While I have your attention, do you know if there's a simple example of turning on a tracepoint and adding a filter?

4ast commented 8 years ago

bcc/src/src/cc/libbpf.c for opening an event. Doing ioctl is trivial. Just pass a string to it: ioctl(fd, PERF_EVENT_IOC_SET_FILTER, filter);

goldshtn commented 8 years ago

All right, I think I have something reasonable at this point -- https://github.com/goldshtn/bcc/tree/tpoint. There are a few small issues to figure out, but the main one is this: some tracepoints don't have a corresponding perf_trace_ function, so I don't know where to place the kprobe.

Here's an example that works nicely:

# ./argdist.py -H 't:block:block_rq_complete(void *dummy, void *dummy2, void *dummy3, int nr_bytes):int:nr_bytes'
[05:36:59]
  nr_bytes            : count     distribution
      0 -> 1          : 0        |                                        |
      2 -> 3          : 0        |                                        |
      4 -> 7          : 0        |                                        |
      8 -> 15         : 1        |****************************************|

But there are many examples of tracepoints for which I can't find the corresponding perf_trace_ function. For example:

power:cpu_idle -- the tracepoint exists, but there's no perf_trace_cpu_idle function on my system or in fact anything in /proc/kallsyms matching perf_trace_.*idle
All syscalls:* tracepoints -- there's just perf_trace_sys_enter and perf_trace_sys_exit but not specific functions for each syscall

Lastly, the perf_trace_ functions' first argument is not declared in the TP_PROTO macro. Do you know what it is? I can try exploring the trace macro gore but if you happen to know...

I'd appreciate any pointers that would help make this even more useful.

/cc @brendangregg @4ast

4ast commented 8 years ago

I've looked through the patches in your tree, but it was too hard for me to see what's going on. My only recommendation is to do it in generic bcc python module instead of in argdist. Other than that clean it up and send a pull req. Don't bother parsing first arg. It's a pointer to raw 'data' buffer. tracepoint code stores its args in there and then perf side copies this buffer into userspace.

goldshtn commented 8 years ago

@4ast Thanks! I'll add it to the bcc module. About my other question, do you know why some tracepoints don't have corresponding perf_trace_ functions? Is there any workaround?

4ast commented 8 years ago

syscall* tracepoints are slow and not recommended. It's faster to kprobe into sys_foo. The perf_trace* for power:cpu_idle is actually done by perf_trace_cpu. $ sudo cat /sys/kernel/debug/tracing/events/power/cpu_frequency print fmt: "state=%lu cpu_id=%lu" which is the same format and perf_trace_cpu() helper function used by power:cpu_idle and power:cpu_frequency. I don't know how to easily tell which perf_trace* corresponds to tracepoint name.

brendangregg commented 8 years ago

Great work! And I didn't realize perftrace* was such a strong convention, but I see it now. I should have seen it sooner -- it's from the macros (TRACE_EVENT etc).

So I don't get how we get to nr_bytes.

# cat /sys/kernel/debug/tracing/events/block/block_rq_complete/format 
name: block_rq_complete
ID: 777
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:dev_t dev;        offset:8;       size:4; signed:0;
        field:sector_t sector;  offset:16;      size:8; signed:0;
        field:unsigned int nr_sector;   offset:24;      size:4; signed:0;
        field:int errors;       offset:28;      size:4; signed:1;
        field:char rwbs[8];     offset:32;      size:8; signed:1;
        field:__data_loc char[] cmd;    offset:40;      size:4; signed:1;

So the args should be: dev, sector, nr_sector, ... right? So eliding arg0 (as @4ast said), the one-liner should be:

./argdist.py -H 't:block:block_rq_complete(void *dummy2, void *dummy3, int nr_sector)

I guess it's clear from your output anyway, since we aren't doing block device I/O of "8 -> 15" bytes :)

As for @4ast's last Q, how we make tracepoint names to perftrace* functions, then I think it can all come from the events pathnames:

/sys/kernel/debug/tracing/events/block/block_rq_complete

I think we can parse that to give us the tracepoint name (last two components -> block:block_rq_complete), and also the function name (last component prefixed with perftrace). Although they don't all exist as those functions (eg, one kernel I'm on is missing perf_event_block_bio_backmerge), so we'd have to check /sys/kernel/debug/tracing/available_filter_functions.

brendangregg commented 8 years ago

RE syscall tracing: we probably should have a docs directory, and a simple syscalls.md document to explain how to trace syscalls, and the caveats of kprobes (stability).

I used to know the difference between sys_read and SyS_read, but I forget right now, and it's not easy to duckduckgo for (aka google for).

I suppose bcc could map syscall tracepoints to the sys kprobes, and have a simple lookup table for that. This would be optional -- you could trace the kprobe directly, or try a syscall tracepoint mapping. The problem is maintenance of the lookup table, especially if we need one bcc repo to support different kernel versions. I'm reminded of SystemTap's tapsets, that get mired in kernel version tests. It might not be so bad if we just do it for syscalls and nothing else.

brendangregg commented 8 years ago

... block:block_bio_backmerge got inlined, hence no perftrace...

4ast commented 8 years ago

@brendangregg re: available_filter_functions is for ftrace only. It's not applicable to kprobe or tracepoints. Most of the time the last two dirs of /sys/kernel/debug/tracing/events/block/block_bio_backmerge will match to perf_trace_aa_bb, but not always. Like in this case the function is perf_trace_block_bio_merge. Inside include/trace/events/block.h there is line "DEFINE_EVENT(block_bio_merge, block_bio_backmerge" which means that tracepoint block:block_bio_backmerge (and block:block_bio_frontmerge) are served by the same perf_trace_block_bio_merge() handler. I don't know how to derive it from things other than source code. I'm not worried about mapping of syscalls. I don't think anyone cares about collecting stats for all syscalls, since new ones are constantly added. There is a standard set of syscalls like sysopen and their prototype are quite stable. Most of the time adding sys will do the trick.

Anyway I'm working on proper support for tracepoints. The goal is to let bpf programs define 2nd argument into bpf prog as a special struct. Like for block/block_rq_complete the 2nd arg will be struct S { dev_t dev; sector_t sector; unsigned int nr_sector; int errors; char rwbs[8]; }; and bpf_prog(struct pt_regs_ctx, struct S_tp) will be able to access it directly as 'tp->nr_sector'.

Technically it's possible to hack it already (without kernel changes). The first argument (that I said not to use) will be filled with such struct, but kprobe should be not on the first instruction, but close to the last and ctx->di will be pointing to filled in struct like above. We'd need to bpf_probe_read() it first. Why it would work? because this is how perftrace* are defined. See include/trace/perf.h:70

goldshtn commented 8 years ago

@brendangregg Re nr_bytes vs. nr_sectors, it's weird but the TP_PROTO macro mentions nr_bytes. See here: http://lxr.free-electrons.com/source/include/trace/events/block.h#L145

TRACE_EVENT(block_rq_complete,
         TP_PROTO(struct request_queue *q, struct request *rq,
                  unsigned int nr_bytes),
...

I don't think it's feasible to come up with the perf_trace_ function name by parsing the source code, so I'll just settle for what I'm currently doing and warn that it might fail for some tracepoints. Indeed, savvy users can simply figure out the perf_trace_ function they need and place a kprobe there directly :-)

@4ast I will experiment with a kretprobe on the perf_trace_ functions and see if I have access to the formatted entry struct. That would be best because that's what the tracepoint consumers are going to expect. (As an aside, is BPF tracepoint support expected in 4.6, 4.7, something later?)

mcaleavya commented 8 years ago

Hello

I did a test looking at nr_bytes vs sector as it didn’t make sense to me how we could get nr_bytes when it is used to work out the sector value. So maybe it didn’t matter what was passed in to the script.

root@henky:~# dd if=/dev/zero of=./aaaa oflag=direct bs=8192 count=10 10+0 records in 10+0 records out 81920 bytes (82 kB) copied, 0.00516186 s, 15.9 MB/s

root@henky:/var/tmp/march# ./argdist.py -H 't:block:block_rq_complete(void dummy, void dummy2, void dummy3, int allan):int:allan' [19:51:10] allan : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 1 |**| [19:51:11] allan : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 1 |******| [19:51:12] allan : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 2 |******| [19:51:13] allan : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 2 |*** | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 10 |****|

with nr_sector

root@henky:/var/tmp/march# # ./argdist.py -H 't:block:block_rq_complete(void _dummy, void dummy2, void dummy3, int nr_sector):int:nr_sector' [20:05:32] [20:05:33] nrsector : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 1 |**_| [20:05:35] nrsector : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 1 |*** | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 10 |*_| [20:05:36] nrsector : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 2 |***\ | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 10 ||

On 5 Mar 2016, at 13:45, Sasha Goldshtein notifications@github.com wrote:

All right, I think I have something reasonable at this point -- https://github.com/goldshtn/bcc/tree/tpoint https://github.com/goldshtn/bcc/tree/tpoint. There are a few small issues to figure out, but the main one is this: some tracepoints don't have a corresponding perftrace function, so I don't know where to place the kprobe.

Here's an example that works nicely:

./argdist.py -H 't:block:block_rq_complete(void dummy, void dummy2, void *dummy3, int nr_bytes):int:nr_bytes'

[05:36:59] nr_bytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 1 |****| But there are many examples of tracepoints for which I can't find the corresponding perftrace function. For example:

power:cpu_idle -- the tracepoint exists, but there's no perf_trace_cpu_idle function on my system or in fact anything in /proc/kallsyms matching perftrace.idle All syscalls: tracepoints -- there's just perf_trace_sys_enter and perf_trace_sys_exit but not specific functions for each syscall Lastly, the perftrace functions' first argument is not declared in the TP_PROTO macro. Do you know what it is? I can try exploring the trace macro gore but if you happen to know...

I'd appreciate any pointers that would help make this even more useful.

/cc @brendangregg https://github.com/brendangregg @4ast https://github.com/4ast — Reply to this email directly or view it on GitHub https://github.com/iovisor/bcc/issues/419#issuecomment-192646451.

goldshtn commented 8 years ago

I might be missing something, but did you simply run the same command with 'allan' and 'nr_sector' as the parameter names? If so, there will be no difference because the parameter names you give the traced method don't matter at all, and have nothing to do with the names used in the trace structure.

goldshtn commented 8 years ago

@4ast I've tried hacking together an example that reads from the entry struct when exiting the perf_trace_ function. I record ctx->di on entry, and then bpf_probe_read from it on exit and populate the entry struct. I might be doing something wrong, but I'm getting nonsensical values. To reproduce, put the following in a .py file and run it, in the background run perf record -e 'block:block_rq_complete'.

#!/usr/bin/env python                                               

from bcc import BPF                                                 
from time import sleep                                              

source = """                                                       
#include <linux/ptrace.h>
#include <linux/blkdev.h>

struct entry {
        dev_t dev;
        sector_t sector;
        unsigned int nr_sector;
        int errors;
        char rwbs[8];
};

BPF_HASH(history, u64, struct entry);
BPF_HASH(curr, u64, u64);

int enter(struct pt_regs *ctx)
{
        u64 tid = bpf_get_current_pid_tgid();
        u64 val = ctx->di;
        curr.update(&tid, &val);
        return 0;
}

int probe(struct pt_regs *ctx)
{
        u64 tid = bpf_get_current_pid_tgid();
        u64 *enter_di = curr.lookup(&tid);
        if (enter_di == 0)
                return 0;

        u64 key = bpf_ktime_get_ns();
        struct entry e = {};
        bpf_probe_read(&e, sizeof(e), (void *)*enter_di);
        history.update(&key, &e);
        return 0;
}
"""

bpf = BPF(text=source)
bpf.attach_kprobe(event="perf_trace_block_rq_complete", fn_name="enter")
bpf.attach_kretprobe(event="perf_trace_block_rq_complete", fn_name="probe")

while True:
        sleep(1)
        print("****")
        for k, v in bpf.get_table("history").items():
                print(k.value, v.rwbs, v.nr_sector, v.sector)

goldshtn commented 8 years ago

@4ast @drzaeus77 Because this would also be my first PR to the bcc module, I wanted to know what you think about the desired API. How about BPF.attach_tracepoint, which takes the tracepoint category and event (e.g., "block", "block_rq_complete") and probe function name and does the perf_event_open magic? What should it do on failure (when it can't resolve perf_trace_nnn for the specified tracepoint)?

4ast commented 8 years ago

@goldshtn sorry my mistake about first arg. remembering ctx->di isn't going to work. I misread kernel code. Trying to figure out another way...

4ast commented 8 years ago

ok, the following changes were needed:

struct entry {
        u64 dontuse;
        dev_t dev;
        sector_t sector;
        unsigned int nr_sector;
        int errors;
        char rwbs[8];
};
...
bpf.attach_kprobe(event="tracing_generic_entry_update", fn_name="enter")

and perf record -a -e 'block:block_rq_complete' output:

(8653982714137064L, 'WS', 0L, 978592712L)
(8653981359998531L, 'W', 16L, 1121216128L)
(8653982714130641L, 'FWS', 0L, 18446744073709551615L)
(8653981364275177L, 'W', 24L, 822182352L)

also I would suggest to use array type instead of hash to record ctx->di with key==0 It will reduce overhead a lot.

goldshtn commented 8 years ago

@4ast Cool, will keep experimenting. Thanks for the help. BTW, not sure how an array with key=0 would work -- there could be multiple threads calling the trace methods, so it doesn't seem safe to record ctx->di with just one key for all these threads. Am I wrong?

goldshtn commented 8 years ago

@4ast OK, so I set up a kprobe on tracing_generic_entry_update, a kretprobe on perf_trace_nnn, and I enable the tracepoint just like I did before (with perf_event_open). The tracing_generic_entry_update function is never called, but the perf_trace_nnn function is.

When I enable the tracepoint with perf record, everything works fine, even if I specify an always-false filter (I use perf record -e 'net:net_dev_xmit' --filter 'common_pid==-17' -a). But it doesn't work when I enable the tracepoint myself.

The code is still on my tpoint branch. Here are the test programs I use outside of the full argdist script. The first one enables a tracepoint using perf_event_open (this is the same as what I do in argdist, but from Python through ctypes):

#include <string.h>              
#include <linux/perf_event.h>    
#include <stdio.h>               
#include <unistd.h>              
#include <errno.h>
#include <sys/ioctl.h>
#include <linux/unistd.h>

int main()
{
        struct perf_event_attr attr = {};
        int pid = 0 /* current */, cpu = -1, group_fd = -1;
        int pfd;

        printf("__NR_perf_event_open = %d\n", __NR_perf_event_open);
        printf("PERF_TYPE_TRACEPOINT = %d\n", PERF_TYPE_TRACEPOINT);
        printf("PERF_SAMPLE_RAW = %d\n", PERF_SAMPLE_RAW);
        printf("PERF_FLAG_FD_CLOEXEC = %d\n", PERF_FLAG_FD_CLOEXEC);
        printf("PERF_EVENT_IOC_SET_FILTER = %d\n", PERF_EVENT_IOC_SET_FILTER);
        printf("PERF_EVENT_IOC_ENABLE = %d\n", PERF_EVENT_IOC_ENABLE);

        attr.config = 1116;        /* net:net_dev_xmit */
        attr.type = PERF_TYPE_TRACEPOINT;
        attr.sample_type = PERF_SAMPLE_RAW;
        attr.sample_period = 1;
        attr.wakeup_events = 1;

        pfd = syscall(__NR_perf_event_open, &attr, pid, cpu, group_fd,
                      PERF_FLAG_FD_CLOEXEC);
        if (pfd < 0) {
                fprintf(stderr, "perf_event_open failed: %s\n",
                        strerror(errno));
                return 1;
        }

        if (ioctl(pfd, PERF_EVENT_IOC_SET_FILTER, "common_pid == -17") < 0) {
                fprintf(stderr, "ioctl to set event filter failed\n");
                return 1;
        }

        if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
                fprintf(stderr, "ioctl to enable event failed\n");
                return 1;
        }

        printf("Hit ENTER to quit.\n");
        getchar();

        return 0;
}

And here's the script, following your recommendations:

#!/usr/bin/env python                                  

from bcc import BPF                                    
from time import sleep                                 

source = """                                         
#include <linux/ptrace.h>
#include <linux/blkdev.h>

struct entry2 {
        u64 __DONT_USE__;
        void *skbaddr;
        unsigned int len;
        int rc;
};

BPF_HASH(history, u64, struct entry2);
BPF_HASH(curr, u64, u64);

int enter(struct pt_regs *ctx)
{
        u64 tid = bpf_get_current_pid_tgid();
        u64 val = ctx->di;
        curr.update(&tid, &val);
        return 0;
}

int probe(struct pt_regs *ctx)
{
        u64 tid = bpf_get_current_pid_tgid();
        u64 *enter_di = curr.lookup(&tid);
        if (enter_di == 0)
                return 0;

        u64 key = bpf_ktime_get_ns();
        struct entry2 e = {};
        bpf_probe_read(&e, sizeof(e), (void *)*enter_di);
        history.update(&key, &e);
        return 0;
}
"""

bpf = BPF(text=source)
bpf.attach_kprobe(event="tracing_generic_entry_update", fn_name="enter")
bpf.attach_kretprobe(event="perf_trace_net_dev_xmit", fn_name="probe")
print(BPF.open_kprobes())

while True:
        sleep(1)
        print("****")
        for k, v in bpf.get_table("history").items():
                print(k.value, v.skbaddr, v.len, v.rc)

brendangregg commented 8 years ago

The name BPF.attach_tracepoint sounds good, and if it returns error status on whether it matched on not, the program can deal appropriately: print a warning or exit or ignore it.

If this is only matching some tracepoints, then we might want a "tlist" tool that lists traceable tracepoints. Later on, when proper tracepoint support exists, tlist should just list them all.

goldshtn commented 8 years ago

Yeah, now we just need to figure out how to get the entry struct and I think the experience can be pretty good. The tlist tool could print the tracepoint details and format, and then argdist and trace would let you access the parameters from the format fields directly.

I don't think I can or should retrofit this magic into the BPF module though. No existing methods there actually introduce their own BPF program, and the tracepoint work would require it (the program that reads ctx->di with the entry address and the program that reads it in a kretprobe when it's already filled).

drzaeus77 commented 8 years ago

I spent some time digging through the TRACE_* macros in the kernel, and I'm still optimistic that there's a way to extract the types automatically. I'll keep looking.

goldshtn commented 8 years ago

I have everything set up to generate the structure from the debugfs format file. I just need a reliable way of getting a pointer to the structure that's filled by the perf_trace functions. Is that what you are referring to in "extract the types automatically"?

drzaeus77 commented 8 years ago

I was hoping to #include and/or use some of the rewriter functionality to glean the struct definition directly from kernel headers, rather than rely on debugfs, though the end result should be the same. Besides there being a bunch of nested macros, its not magic how the structs are defined, and should be possible.

goldshtn commented 8 years ago

I don't mind getting the structure from debugfs (especially since I already wrote the code). It's getting a pointer to the filled struct that's an issue. It's being filled by the assign part in the macros. Any suggestion for where to put a probe to get the structure pointer would help a lot.

drzaeus77 commented 8 years ago

It looks to me like the call to perf_trace_buf_submit just after perf_trace_##call is probed will have the filled struct entry in arg0. I haven't been following this discussion closely enough to know if that submit will occur or if the way you have the tracepoint set up will skip over the submit bits. I guess I'll have to try your code myself.

goldshtn commented 8 years ago

Yeah, that would be a good place to try. I will do it tomorrow. Unfortunately I don't see a way to figure out the tracepoint id when looking only at this function, so I will still need a probe in the specific perf_trace function followed by a probe in perf_tp_event (which the inlinable perf_trace_buf_submit calls).

4ast commented 8 years ago

probe in perf_tp_event is not great, since it will be firing for all tracepoints. specifying 'common_pid == -1' filter will be sort-of ignored from our hack point of view. Since tracepoint data will be stored in the buffer and kretprobe will be called and can see the data. See perftrace##cal() in include/trace/perf.h for details on how it works. Currently I cannot come up with reasonable explanation that perf_trace_xx is called but tracing_generic_entry_update is not.

goldshtn commented 8 years ago

The weird thing is that it is called when I use perf to enable the tracepoint, even with the filter. Maybe it's worth looking more into what perf is doing. Maybe need to set up the mmap buffer in order to reach our desired code path.

Can you also explain why perf_tp_event would be called for all tracepoints? Do you mean tracepoints not enabled by our code, so we would be probing unnecessarily?

4ast commented 8 years ago

yes. it's easier to look at perftrace##call() in include/trace/perf.h to see what's it's doing and why kprobe at perf_tp_event() is not going to work. imo the code is self explanatory

goldshtn commented 8 years ago

I looked at the code, indeed. I think I can record a temporary marker into a BPF array in the perf_trace_nnn function, and then check for that marker in perf_trace_buf_submit and clear it. That way we are still probing all tracepoints but not doing the actual reporting work for tracepoints we don't own. How's that sound?

4ast commented 8 years ago

I don't see how that would work. The whole point of tracepoints is to see the args, so the main code of the program need to operate after { assign; } part. What I was proposing is to get a pointer from perf_trace_buf_prepare->tracing_generic_entry_update and process that pointer in kRETprobe of perf_trace_xxx. Does it make more sense now after looking the perf_trace_xxx code?

goldshtn commented 8 years ago

In the version of the function I'm looking at (http://lxr.free-electrons.com/source/include/trace/perf.h#L70), the submit function comes after the assign part. What am I missing?

4ast commented 8 years ago

yes we're looking at the same stuff. It's called for all tracepoints, so how single bpf program would deal with all variants of struct passed as first arg into perf_trace_buf_submit ?

goldshtn commented 8 years ago

(So the perf_trace_nnn function records a marker that indicates which tracepoint it is, and then when submit is called I can tell if it's my tracepoint and do the main part of the program.)

goldshtn commented 8 years ago

Oh, I would have a single BPF program that calls into child functions for each tracepoint. Each child function would recognize its own struct.

goldshtn commented 8 years ago

if recorded tp id == 1106 process tp struct for 1106 else ...

I realize it would get us close to the BPF program size limit. But the tracing entry update function isn't called for some reason, so I figured it would be worth a shot.

4ast commented 8 years ago

how that is going to work with multiple scripts doing it at the same time?

goldshtn commented 8 years ago

It is not. But neither would placing a probe in tracing_generic_entry_update. I think pretty much all the tools place probes in functions and can't be run simultaneously.

I am not proposing multiple probes in a single function in a single tool invocation. Argdist would have to generate a single BPF program that handles all the tracepoints the user asks for.

4ast commented 8 years ago

I suspect that if you see tracing_generic_entry_update is not called then perf_trace_buf_submit won't be called either, so you probably need to debug it regardless. No other ideas so far...

goldshtn commented 8 years ago

indeed, perf_trace_buf_submit is not called either. I suppose it means I'm hitting the early return condition in the perf_trace_nnn function:

 head = this_cpu_ptr(event_call->perf_events);                   \
if (__builtin_constant_p(!__task) && !__task &&                  \
                         hlist_empty(head))                      \
      return;

Not sure what this perf_events list should be, but could it be something perf record initializes and just enabling the tracepoint with perf_event_open doesn't initialize?

goldshtn commented 8 years ago

I figured it out after strace-ing perf record .... I was providing pid=0, cpu=-1 in my call to perf_event_open, which means monitor only the current process on any CPU. I suspect this causes the task-related check above to fail early in perf_trace_nnn. Now I'm using pid=-1, cpu=0, which means monitor any process on CPU 0, and I see the perf_trace_buf_submit function is being called, and I can get the entry structure successfully.

What I wonder, though, is whether I now need to call perf_event_open for each CPU. It would seem that way (e.g. from this thread) but if you can confirm for sure, it would be great. It doesn't seem that perf record ... -a does it -- here's the relevant excerpt from strace:

perf_event_open(0x7ffdb4f4f200, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 4
close(4)                                = 0
perf_event_open(0x33c5e48, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOENT
perf_event_open(0x33c5e48, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOENT
perf_event_open(0x33c5e48, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 4
close(4)                                = 0
perf_event_open(0x33c5e48, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 4
close(4)                                = 0
perf_event_open(0x33b67d8, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 4

Sure enough, there are four calls to perf_event_open that succeed and return the same fd (4), but it's only the perf_event_attr parameter that is sometimes different; the pid and cpu parameters are always -1 and 0.

/cc @4ast

4ast commented 8 years ago

not sure what's wrong with your strace... did you run it under root? here is what I see: perf_event_open(0x886f40, -1, 0, -1, 0x8 /* PERFFLAG??? _/) = 4 perf_eventopen(0x886f40, -1, 1, -1, 0x8 / PERFFLAG??? _/) = 5 perf_eventopen(0x886f40, -1, 2, -1, 0x8 / PERFFLAG??? _/) = 6 perf_eventopen(0x886f40, -1, 3, -1, 0x8 / PERFFLAG??? _/) = 8 perf_eventopen(0x886f40, -1, 4, -1, 0x8 / PERFFLAG??? _/) = 9 perf_eventopen(0x886f40, -1, 5, -1, 0x8 / PERFFLAG??? _/) = 10 perf_eventopen(0x886f40, -1, 6, -1, 0x8 / PERFFLAG??? _/) = 11 perf_eventopen(0x886f40, -1, 7, -1, 0x8 / PERFFLAG??? */) = 12 it has to be called for all cpus.

iovisor / bcc

Kernel tracepoint support (ugly) #419

./argdist.py -H 't:block:block_rq_complete(void dummy, void dummy2, void *dummy3, int nr_bytes):int:nr_bytes'