bcc hangs when probing malloc

trainman419 commented 5 years ago

I'm trying to get some better information about how much time my programs are spending in malloc, how frequently they call malloc and free, and how big their allocations are.

I have a small bcc script that installs BPF uprobes on malloc and free, to collect and summarize this data, and write it to disk as frequently as possible.

When I run stress -m 200 --vm-bytes 1024 (or the program I'm trying to measure), my bcc script hangs while trying to read or write the BPF tables:

^C
Traceback (most recent call last):
  File "./bpf_malloc.py", line 84, in main
    alloc_stats['malloc'][tid.value] = malloc_calls[tid]
KeyboardInterrupt

My full script is:

#!/usr/bin/env python

from __future__ import print_function

import bcc
import json
import time
import traceback

BPF_PROG="""
struct malloc_info {
    u64 total_duration;
    u64 total_size;
    u64 count;
};

BPF_HASH(malloc_start, pid_t, u64);
BPF_HASH(malloc_calls, pid_t, struct malloc_info);
BPF_HASH(free_calls, pid_t, u64);

int probe_malloc(struct pt_regs *ctx) {
    u64 time = bpf_ktime_get_ns();
    size_t size = PT_REGS_PARM1(ctx);
    u32 tgid = (bpf_get_current_pid_tgid() >> 32);
    struct malloc_info malloc_zero = {
        .total_duration = 0,
        .total_size = 0,
        .count = 0
    };

    malloc_start.update(&tgid, &time);
    struct malloc_info *malloc = malloc_calls.lookup_or_init(&tgid, &malloc_zero);
    malloc->total_size += size;
    malloc->count++;
    return 0;
}

int probe_malloc_ret(struct pt_regs *ctx) {
    u64 time = bpf_ktime_get_ns();
    u32 tgid = (bpf_get_current_pid_tgid() >> 32);
    u64 *start = malloc_start.lookup(&tgid);
    if(start) {
        struct malloc_info malloc_zero = {
            .total_duration = 0,
            .total_size = 0,
            .count = 0
        };
        struct malloc_info *malloc = malloc_calls.lookup_or_init(&tgid, &malloc_zero);
        malloc->total_duration += (time - *start);
    }
    return 0;
}

int probe_free(struct pt_regs *ctx) {
    u32 tgid = (bpf_get_current_pid_tgid() >> 32);
    free_calls.increment(tgid);
    return 0;
}
"""

def main():
    b = bcc.BPF(text=BPF_PROG)

    # Attach probes for malloc
    b.attach_uprobe(name="c", sym="malloc", fn_name='probe_malloc')
    b.attach_uretprobe(name="c", sym="malloc", fn_name='probe_malloc_ret')
    b.attach_uprobe(name="c", sym="free", fn_name='probe_free')

    alloc_file = open('allocs.json', 'w')
    alloc_file.write('[\n')

    malloc_calls = b.get_table('malloc_calls')
    free_calls = b.get_table('free_calls')

    start = time.time()
    i = 0;
    try:
        while True:
            alloc_stats = {'t': time.time(), 'malloc': {}, 'free': {}}
            for tid in malloc_calls:
                alloc_stats['malloc'][tid.value] = malloc_calls[tid]
                # Zero out the counter the same way that `malloc_calls.zero()` would
                # this attempts to avoid a race condition with the BPF program
                malloc_calls[tid] = malloc_calls.Leaf()

            for tid in free_calls:
                alloc_stats['free'][tid.value] = free_calls[tid].value
                free_calls[tid] = free_calls.Leaf()

            for tid in alloc_stats['malloc']:
                alloc_stats['malloc'][tid] = { field: getattr(alloc_stats['malloc'][tid], field) for field, _ in alloc_stats['malloc'][tid]._fields_ }

            if i != 0:
                alloc_file.write(',\n')
            alloc_file.write(json.dumps(alloc_stats))

            # Sleep for remaining time in interval.
            start += 0.1
            remaining_time = start - time.time()
            if remaining_time > 0:
                time.sleep(remaining_time)
            print("Slack time {:.3f} seconds".format(remaining_time))

            i += 1
    except KeyboardInterrupt:
        traceback.print_exc()

    alloc_file.write(']\n')
    alloc_file.close()

if __name__ == '__main__':
    main()

yonghong-song commented 5 years ago

I can reproduce the issue. Not sure what is the root cause. Using kernel bpftool to dump the map seems okay. So most likely a bcc issue. Still investigating.

yonghong-song commented 5 years ago

Found the root cause. In your above code, you did the table update in the middle of table traversal,

            for tid in malloc_calls:
                alloc_stats['malloc'][tid.value] = malloc_calls[tid]
                # Zero out the counter the same way that `malloc_calls.zero()` would
                # this attempts to avoid a race condition with the BPF program
                malloc_calls[tid] = malloc_calls.Leaf()

In bpf syscall sequences like

bpf: get_next_key
bpf: update_element
bpf: get_next_key
bpf: update_element
...

The problem is update_element may change element hash position. It deletes the old element, create a new element and inserts it into the hash table. The kernel will insert the element in the beginning of the bucket link list so that that element will be picked instead of old one (rcu protected, not released yet). This will cause a problem and will create a infinite loop for the above get_next_key/update sequence if both current and next key are in the same bucket.

To fix the problem, you can clear the table after the traversal like below:

            alloc_stats = {'t': time.time(), 'malloc': {}, 'free': {}}
            for tid in malloc_calls:
                alloc_stats['malloc'][tid.value] = malloc_calls[tid]
            malloc_calls.clear();

            for tid in free_calls:
                alloc_stats['free'][tid.value] = free_calls[tid].value
            free_calls.clear();

palmtenor commented 5 years ago

This emphasizes the need of a more atomic whole-map dump / clear operation from Kernel side:)) cc @4ast

trainman419 commented 5 years ago

Thanks! I've confirmed that with those changes, my program no longer hangs (but it does get a bit slow).

I've been thinking I wanted an atomic map swap-and-clear operation for a number of the tools that I'm trying to write. I've been reading a lot of source code within the kernel for BPF so I'm getting pretty familiar with it; do you have any advice on how I might go about adding an operation like that?

yonghong-song commented 5 years ago

Commit

commit f1a2e44a3aeccb3ff18d3ccc0b0203e70b95bd92
Author: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Date:   Thu Oct 18 15:16:25 2018 +0200

    bpf: add queue and stack maps

    Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
    These maps support peek, pop and push operations that are exposed to eBPF
    programs through the new bpf_map[peek/pop/push] helpers.  Those operations
    are exposed to userspace applications through the already existing
    syscalls in the following way:
...

added support for new map operations

+       int (*map_push_elem)(struct bpf_map *map, void *value, u64 flags);
+       int (*map_pop_elem)(struct bpf_map *map, void *value);
+       int (*map_peek_elem)(struct bpf_map *map, void *value);

In your case, you could implement map_pop_elem such that the kernel will return the first element in the hash table and remove it in the kernel. This is not as ideal as atomic whole-map dump/clear operation, but at least it did one syscall instead of two.

If you really want to do swap-and-clear, you may need to implement a new map op in the kernel. But you need a strong use case for it.

trainman419 commented 5 years ago

Ok, that seems easy enough to add. Which kernel mailing list should I submit my patches too?

My use case for a swap-and-clear is pretty simple:

Capture the state of a map without losing data between the read and clear operations.
Read out the full state of the map with a single syscall, so that I can out the full state at about 10Hz (or maybe 100Hz, if it works well enough)

For background, I'm working on a real-time system, so I need high-frequency profiling data to expose the instantaneous performance bottlenecks. Different parts of my system run at rates between 10Hz and 50Hz, so 100Hz would be the nyquist sampling rate, but I could probably "read the tea leaves" from data captured less frequently.

Do you think that's a strong enough justification for a swap-and-clear, or do you think that I can achieve the rates that I'm looking for with a pop-and-clear operation?

yonghong-song commented 5 years ago

On Mon, Nov 19, 2018 at 6:39 PM Austin notifications@github.com wrote:

Ok, that seems easy enough to add. Which kernel mailing list should I submit my patches too?

Please follow instructions under https://github.com/torvalds/linux/blob/master/Documentation/bpf/bpf_devel_QA.rst#q-to-which-mailing-list-do-i-need-to-submit-my-bpf-patches

My use case for a swap-and-clear is pretty simple:

Capture the state of a map without losing data between the read and clear operations.

Just for one read and clear, the "pop" map method could be implement such that the first map element value is given to user space and that element is deleted from the kernel. But if map update is quite frequent in your system, you may stuck in "pop" longer than expected and this may not be what you want.

Read out the full state of the map with a single syscall, so that I can out the full state at about 10Hz (or maybe 100Hz, if it works well enough)

Trying to read the full state of the map with a single syscall is the best to read out the full state of the map with much accuracy. However, since # of entries could be huge, copying may not be a viable solution. One possible way is to allocate map memory in user space and kernel writes the data to these places and user space will get them asynchronously. User space can get the map data pretty easily. Note that map data may change during user space traverses the map. This scheme needs pretty big kernel change.

For background, I'm working on a real-time system, so I need high-frequency profiling data to expose the instantaneous performance bottlenecks. Different parts of my system run at rates between 10Hz and 50Hz, so 100Hz would be the nyquist sampling rate, but I could probably "read the tea leaves" from data captured less frequently.

I guess you need an operation like "give me all map values and remove all of them in the kernel". I suggest you just pick one implementation (e.g., a new bpf subcommand) and implement it, post it to netdev to start the conversation.

Do you think that's a strong enough justification for a swap-and-clear, or

do you think that I can achieve the rates that I'm looking for with a pop-and-clear operation?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/iovisor/bcc/issues/2046#issuecomment-440116547, or mute the thread https://github.com/notifications/unsubscribe-auth/ALq6olOIVQSx04tPRn449gR1WwP_Yhcfks5uw2t5gaJpZM4YjViM .

iovisor / bcc

bcc hangs when probing malloc #2046