filetop.py: map.lookup_or_init() is wrong?

201341 commented 5 years ago

I use the filetop.py to show reads and writes by file with process. But find something strange hanpens. the result is as follow

TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE
4236   collectd         158    0      623     0       R cmdline
4236   collectd         8      0      31      0       R cmdline
4236   collectd         6      0      24      0       R cmdline
4236   collectd         4      0      16      0       R cmdline
4236   collectd         4      0      15      0       R cmdline
4236   collectd         11     0      11      0       R status
4232   collectd         10     0      10      0       R mounts
4236   collectd         8      0      8       0       R stat
4236   collectd         2      0      8       0       R cmdline
4235   collectd         2      0      8       0       R udev.conf
4236   collectd         2      0      8       0       R cmdline
14469  pmcd_wait        2      0      8       0       R hosts
4236   collectd         2      0      8       0       R cmdline
4236   collectd         2      0      8       0       R cmdline
4236   collectd         2      0      7       0       R cmdline
4236   collectd         2      0      7       0       R cmdline
4236   collectd         2      0      7       0       R cmdline

The line from one to five has same process id, comm, and file. why they print on different lines? Is the api map.lookup_or_init() wrong or some reasons？

palmtenor commented 5 years ago

One possibility is that the comm or name strings has garbage left after \0 terminator, but BPF hash table will still treat them as distinct keys.

yonghong-song commented 5 years ago

The hacking below showed @palmtenor's point above is right.

diff --git a/tools/filetop.py b/tools/filetop.py
index ccc5a107..619b2897 100755
--- a/tools/filetop.py
+++ b/tools/filetop.py
@@ -107,8 +107,26 @@ static int do_entry(struct pt_regs *ctx, struct file *file,
     // store counts and sizes by pid & file
     struct info_t info = {.pid = pid};
     bpf_get_current_comm(&info.comm, sizeof(info.comm));
+    int i;
+    int start = -1;
+    #pragma unroll
+    for (i = 0; i < TASK_COMM_LEN; i++) {
+      if (start > 0)
+         info.comm[i] = 0;
+      else if (info.comm[i] == 0)
+         start = 1;
+    }
+
     info.name_len = d_name.len;
     bpf_probe_read(&info.name, sizeof(info.name), d_name.name);
+    start = -1;
+    #pragma unroll
+    for (i = 0; i < DNAME_INLINE_LEN; i++) {
+      if (start > 0)
+         info.name[i] = 0;
+      else if (info.name[i] == 0)
+         start = 1;
+    }
     if (S_ISREG(mode)) {
         info.type = 'R';
     } else if (S_ISSOCK(mode)) {

We may need a better kernel or bcc interface to deal with this nicely. Any suggestions?

palmtenor commented 5 years ago

I used to use bpf_probe_read's error behavior to serve as a poorman's bzero...

yonghong-song commented 5 years ago

@palmtenor I forgot how did you exactly do this? if bzero the whole thing, __builtin_memset should work. Here, we want to partial bzero.

palmtenor commented 5 years ago

Yes, it's just that __builtin_memset consumes too many instructions for larger buffer. And I think I was actually using bpf_perf_event_read_value and provide it with an invalid struct size, and the helper would just zero out the buffer...

201341 commented 5 years ago

Thanks for your repley. Is there any suggestion to deal with these? and I find another question, when I use command cat to read a small file whose size is 9 bit. And the filetop show the r_kb is 128KB as follow

TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE
21437  cat              2      0      128     0       R testv.txt
21438  cat              2      0      128     0       R testv.txt

[root@node ~]# stat testv.txt 
  File: ‘testv.txt’
  Size: 9               Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 70603259    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:admin_home_t:s0
Access: 2019-05-20 14:48:13.231039857 +0800
Modify: 2019-05-20 14:48:02.460952507 +0800
Change: 2019-05-20 14:48:02.460952507 +0800

yonghong-song commented 5 years ago

For the original question, using bpf_probe_read_str() helper can fix the issue.

-bash-4.4$ git diff
diff --git a/tools/filetop.py b/tools/filetop.py
index ccc5a107..002d0747 100755
--- a/tools/filetop.py
+++ b/tools/filetop.py
@@ -108,7 +108,7 @@ static int do_entry(struct pt_regs *ctx, struct file *file,
     struct info_t info = {.pid = pid};
     bpf_get_current_comm(&info.comm, sizeof(info.comm));
     info.name_len = d_name.len;
-    bpf_probe_read(&info.name, sizeof(info.name), d_name.name);
+    bpf_probe_read_str(&info.name, sizeof(info.name), d_name.name);
     if (S_ISREG(mode)) {
         info.type = 'R';
     } else if (S_ISSOCK(mode)) {
-bash-4.4$

But bpf_probe_read_str is not available in earlier kernel versions. But you can check /proc/kallsyms for bpf_probe_read_str for its availability.

Do you mind to submit a pull request for this?

yonghong-song commented 5 years ago

For the other issue related to r_kb is 128KB. The issue is due to read system call. For read system call, ssize_t read(int fd, void *buf, size_t count), the count may not be the actual count, it is the buffer size to read, using strace cat testv.txt you probably will find the count is 64KB which is actually the buffer size.

To get accurate read size, need to use kretprobe instead of kprobe for vfs_read. Maybe during kprobe, create a hash table with key = task_struct, and during kretprobe, based on the same task struct to fill in read_size? Since you are studying this, could you explore this idea further?

201341 commented 5 years ago

Thanks for your comment. It helps me a lot, I will pull request for this later.

iovisor / bcc

filetop.py: map.lookup_or_init() is wrong? #2360