ebpf之路(2023版本) - Githubissues

lzh2nix commented 1 year ago

开篇

之前也零零散散看过一些ebpf相关的知识, 就是没有深入的去学习, 2023 Q2开始计划每个Q在技术上只专注在一个点上(避免不聚焦导致最后一事无成). 花3个月深入的了解一门技术.

Q3计划专注在ebpf上, 整理了三个KR:

[ ] KR1 看完 << learning eBPF >> 并整理学习笔记
[ ] KR2 ebpf相关项目三个PR
[ ] KR3 看看 << bpf之巅 >> 并整理学习笔记

Content

lzh2nix commented 1 year ago

Learning eBPF Preface && CH1(2023.07.01)

在各种CNCF,eBPF相关的大会上你可能都会看到本书作者 Liz Rice 的演讲, 她是CNCF的 governing board & TOC emeritus chair, 也是ebpf背后isovalent公司的Chief Open Source Officer. 这么书写的相对也比较简单,适合作为对ebpf的一个入门书籍.

ebpf应该是最近几年内核里最火热的一个模块. 基本上每年都有基于ebpf的新项目产生(主要是和network, security, observability相关的).

BPF大事记:

1993 Steven McCane 和Van Jacobson 发表了 The BSD Packet Filter: A New Architecture for User-level Packet Capture, 实现了网络的抓包
2014 3.8 版本中第一次引入了ebpf
2015 kprobe 加入了内核
2016 Brendan Gregg做了很多基于ebpf的性能分析工具(superpower to linux) 同年 Cilium 宣布开搞使得ebpf 广为人知
2017 Facebook基于ebpf实现了4层的负载均衡Katran
2018 ebpf作为kernel的一个单独模块独立出来了
2020 允许ebpf attach到LSM(Linux Security Module)使得基于ebpf的各种安全项目成为可能

ebpf 应用变的这么广泛究其原因就是其简化了内核"开发", 在不修改linux内核的情况下也可以对内核做扩展.

Back To Top

lzh2nix commented 1 year ago

CH2(2023.7.4)

所有的编码练习都是以 "hello world" 开始,ebpf 也不例外. 书中例子都是python, 按照个人喜好这里选择golang, 主要有两个原因:

通过把书中的例子用golang实现一遍会有跟深都认识(提炼出属于自己的ebpf util function)
除了bcc之外其他项目都是使用golang来实现的, 使用golang更加的契合实际的项目需求

这里使用 cilium/ebpf 作为底层的库. ebpf 编程分为两部分, golang部分(用户层程序)和C部分(ebpf部分), 一个简单的例子如下, 每当发生execve的系统调用时 ebpf程序就打印出 "hello world":

#include "../headers/vmlinux-arm64.h"
#include "../headers/bpf/bpf_helpers.h"

char __license[] SEC("license") = "Dual MIT/GPL";

SEC("kprobe/sys_execve")
int kprobe_execve() {
  bpf_printk("hello world\n");
  return 0;
}

然后通过 $BPF_CLANG -cflags $BPF_CFLAGS bpf hello.c -- -I../headers 生成对于的ebpf字节码, go generate 生成对应的golang程序供上层使用(本例中只是加载ebpf程序到内核).

package main

import (
    "log"
    "os"
    "os/signal"
    "time"

    "github.com/cilium/ebpf/link"
    "github.com/cilium/ebpf/rlimit"
)

// $BPF_CLANG and $BPF_CFLAGS are set by the Makefile.
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc $BPF_CLANG -cflags $BPF_CFLAGS bpf hello.c -- -I../headers

const mapKey uint32 = 0

func main() {

    // Name of the kernel function to trace.
    fn := "sys_execve"

    // Allow the current process to lock memory for eBPF resources.
    if err := rlimit.RemovMemlock(); err != nil {
        log.Fatal(err)
    }

    // Load pre-compiled programs and maps into the kernel.
    objs := bpfObjects{}
    if err := loadBpfObjects(&objs, nil); err != nil {
        log.Fatalf("loading objects: %v", err)
    }
    defer objs.Close()

    // Open a Kprobe at the entry point of the kernel function and attach the
    // pre-compiled program. Each time the kernel function enters, the program
    // will increment the execution counter by 1. The read loop below polls this
    // map value once per second.
    kp, err := link.Kprobe(fn, objs.KprobeExecve, nil)
    if err != nil {
        log.Fatalf("opening kprobe: %s", err)
    }
    defer kp.Close()

    // Read loop reporting the total amount of times the kernel
    // function was entered, once per second.
    ticker := time.NewTicker(1 * time.Second)
    defer ticker.Stop()

    log.Println("Waiting for events..")
    sig := make(chan os.Signal, 1)
    signal.Notify(sig, os.Interrupt)
    <-sig
}

上面代码实现的效果书中python的效果一样, 当发生系统调用是打印hello world. Pasted image 20230707070941

Pasted image 20230707065604

上面程序中bpf程序直接打印到了 /sys/kernel/tracing/trace_pipe. 在实际应用中我们更多的希望这个hello程序和main.go 做一些交互. 这也就促成了各种map(详细列表可以参考 https://docs.kernel.org/bpf/maps.html)的产生(用户程序和ebpf程序沟通的桥梁). ebpf map的三个主要使用场景:

用户程序将配置信息下发给ebpf程序
多个ebpf之间map进行通信
ebpf将结果写入到map中供上层应用程序使用

下面就以两个例子来看下ebp map怎么使用,其他类型的map在具体使用的时候查询手册即可.

BPF_MAP_TYPE_HASH

统计进程掉用 execve 的次数, 在bpf里向指定的map里写, 然后在userspace 读写入的值.

// +build ignore

#include "../headers/vmlinux-arm64.h"
#include "../headers/bpf/bpf_helpers.h"

char __license[] SEC("license") = "Dual MIT/GPL";

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, __u32);
    __type(value, __u64);
    __uint(max_entries, 1024);
} counter_map SEC(".maps");

SEC("kprobe/sys_execve")
int kprobe_execve() {
    u32 pid;
    u64 initval = 1, *valp;
    pid = bpf_get_current_pid_tgid() >>32;
    valp = bpf_map_lookup_elem(&counter_map, &pid);
    if (!valp) {
        bpf_map_update_elem(&counter_map, &pid, &initval, BPF_ANY);
        return 0;
    }
    __sync_fetch_and_add(valp, 1);
    return 0;
}

golang 关键部分代码:

    go func() {
        for range ticker.C {
            var k uint32
            var v uint64
            iter := objs.CounterMap.Iterate()
            for iter.Next(&k, &v) {
                fmt.Printf("pid(%d) call %s %d times\n", k, fn, v)
            }
        }
    }()

BPF_MAP_TYPE_PERF_EVENT_ARRAY

在上一个例子中每次发生一次调用我们就先map[pid]++, 有另外一种方式就是直接向userspace发送一个raw event, 然后在userspace 对该event进行解析.

// +build ignore

#include "../headers/bpf/bpf_helpers.h"
#include "../headers/vmlinux-arm64.h"

char __license[] SEC("license") = "Dual MIT/GPL";

struct data_t {
  u32 pid;
  u32 uid;
  char command[16];
};
struct {
  __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
} events SEC(".maps");

const struct data_t *unused __attribute__((unused));

SEC("kprobe/sys_execve")
int kprobe_execve(struct pt_regs *ctx) {
  struct data_t data = {};
  data.pid = bpf_get_current_pid_tgid() >> 32;
  data.uid = bpf_get_current_uid_gid() & 0xFFFFFFFF;
  bpf_get_current_comm(&data.command, sizeof(data.command));
  bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &data, sizeof(data));
  return 0;
}

用户层代码:

type data_t struct {
    Pid     uint32
    Uid     uint32
    Command [16]byte
}
    var event data_t
    for {
        record, err := rd.Read()
        if err != nil {
            if errors.Is(err, perf.ErrClosed) {
                return
            }
            log.Printf("reading from perf event reader: %s", err)
            continue
        }

        if record.LostSamples != 0 {
            log.Printf("perf event ring buffer full, dropped %d samples", record.LostSamples)
            continue
        }
        // Parse the perf event entry into a bpfEvent structure.
        if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
            log.Printf("parsing perf event: %s", err)
            continue
        }
        log.Printf("event.pid = %d , uid = %d, cmd = %s\n", event.Pid, event.Uid, string(event.Command[:]))
    }

其他类型的map在后面使用时再做进一步的介绍:

Back To Top

lzh2nix commented 1 year ago

CH3 Anatomy of an eBPF program(2023.7.10)

这一章通过一个ebpf程序+bpftool来熟悉了一下ebpf程序的工作过程:

            c code ----->bytecode -------> machineCode

从编译到各种dump 观察, 有一种放到显微镜下观察ebpf程序的感觉.

clang 编译成bpf bytecode

hello.bpf.o: %.o: %.c
clang \
-target bpf \
-I/usr/include/$(shell uname -m)-linux-gnu \
-g \
-O2 -c $< -o $@

bpf程序的手动加载 bpftool prog load hello.ebpf.o /sys/fs/bpf/hello

查看已经加载的ebpf程序 bpftool prog list

epbf 使用方法可以参考 man

Usage: bpftool [OPTIONS] OBJECT { COMMAND | help }
       bpftool batch file FILE
       bpftool version

       OBJECT := { prog | map | link | cgroup | perf | net | feature | btf | gen | struct_ops | iter }
       OPTIONS := { {-j|--json} [{-p|--pretty}] | {-d|--debug} |
                    {-V|--version} }

或者下面这篇文章 https://qmonnet.github.io/whirl-offload/2021/09/23/bpftool-features-thread/

Back To Top

lzh2nix commented 1 year ago

CH4 The bpf() system call(2023.07.11)

用户层和内核交流还是走系统调用, ebpf也不例外. ebpf是走一个特有的系统的调用就要 bpf

       int bpf(int cmd, union bpf_attr *attr, unsigned int size);

ebpf程序的主要流程:

加载ebpf程序到内核
attach 到指定的events
map的读写

其实以上三部都需要通过bpf 程序来完成, 书中详细的trace 了一个ebpf程序来观察每一次系统调用, 我们这里也以前面的一个BPF_MAP_TYPE_HASH为例子:

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, __u32);
    __type(value, __u64);
    __uint(max_entries, 1024);
} counter_map SEC(".maps");

SEC("kprobe/sys_execve")
int kprobe_execve() {
    u32 pid;
    u64 initval = 1, *valp;
    pid = bpf_get_current_pid_tgid() >>32;
    valp = bpf_map_lookup_elem(&counter_map, &pid);
    if (!valp) {
        bpf_map_update_elem(&counter_map, &pid, &initval, BPF_ANY);
        return 0;
    }
    __sync_fetch_and_add(valp, 1);
    return 0;
}

为例 (strace -f -e bpf counter)

BTF相关加载

bpf(BPF_BTF_LOAD, {btf="\237\353\1\0\30\0\0\0\0\0\0\0\20\0\0\0\20\0\0\0\1\0\0\0\0\0\0\0\0\0\0\1"..., btf_log_buf=NULL, btf_size=41, btf_log_size=0, btf_log_level=0}, 32) = 3
bpf(BPF_BTF_LOAD, {btf="\237\353\1\0\30\0\0\0\0\0\0\0\30\0\0\0\30\0\0\0\3\0\0\0\1\0\0\0\0\0\0\f"..., btf_log_buf=NULL, btf_size=51, btf_log_size=0, btf_log_level=0}, 32) = 3
bpf(BPF_BTF_LOAD, {btf="\237\353\1\0\30\0\0\0\0\0\0\0\30\0\0\0\30\0\0\0\3\0\0\0\1\0\0\0\1\0\0\f"..., btf_log_buf=NULL, btf_size=51, btf_log_size=0, btf_log_level=0}, 32) = 3
bpf(BPF_BTF_LOAD, {btf="\237\353\1\0\30\0\0\0\0\0\0\08\0\0\08\0\0\0-\0\0\0\1\0\0\0\0\0\0\10"..., btf_log_buf=NULL, btf_size=125, btf_log_size=0, btf_log_level=0}, 32) = 3

创建counter_map

bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_HASH, key_size=4, value_size=8, max_entries=1024, map_flags=0, inner_map_fd=0, map_name="counter_map", map_ifindex=0, btf_fd=3, btf_key_type_id=1, btf_value_type_id=2, btf_vmlinux_value_type_id=0, map_extra=0}, 72) = 4

加载程序

[pid 32749] bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=23, insns=0x40000dc000, license="Dual MIT/GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(6, 3, 8), prog_flags=0, prog_name="kprobe_execve", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=3, func_info_rec_size=8, func_info=0x400001f900, func_info_cnt=1, line_info_rec_size=16, line_info=0x40000b8100, line_info_cnt=12, attach_btf_id=0, attach_prog_fd=0, fd_array=NULL}, 144) = 8

用户层访问map

bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=4, key=0x4000015008, value=0x4000015010, flags=BPF_ANY}, 32) = 0
bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=4, key=NULL, next_key=0x4000015008}, 24) = 0

Back To Top

lzh2nix commented 1 year ago

CH5 CO-RE, BTF, and libbpf(2023.07.14)

可以在上一章里我们已经看到了BTF(BPF Type Format), 主要目的是为了bpf程序的可移植性, 简单来说就是(compile once, run everyhere). 在目前的实例我们暂时还没有遇到需要访问内核数据结构的case, 不过ebpf作为用户程序和内核之间的桥梁, 访问内核的数据结构是必然的. 内核本身在不断的迭代, 不可能把所有内核的头文件都包进来, 然后不同的版本走不同的结构体. 这就出现了CO-RE项目.

针对这个问题BCC的解法是BCC包含了一套llvm的编译工具, 在实际用运行的时候根据机器实际情况, 先编译再运行. 但是这样会有一下的问题:

llvm 工具集太大, 每台机器上都装上llvm工具集不太现实
在运行时编译, 太耗时
python, llvm各种编译错误在线上机器解决太痛苦(也消耗了宝贵的故障恢复时间)

基于上面的几个痛点CO-RE 应运而生.

CO-RE的几个核心组件:

BTF(BPF Type Format) 内核暴露的可供bpf使用数据结构+方法签名,通过其确定bpf程序访问的结构体在编译时和运行时的layout. bpftool 可以给予BPF文件生成对应human readable的代码, 不过要的一点是BTF在 5.4的内核中才引入, 需要开启CONFIG_DEBUG_INFO_BTF=y 的编译选项
```
bpftool btf dump file /sys/kernel/btf/vmlinux format c
```
编译器支持: 给clang 加上 -g 来支持BPF relocation
ebpf loader支持, 在ebpf加载时根据不同版本的内核动态的调整ebpf的bytecode(基于编译是的relocation信息)

可以简单看一下他的工作原理, 首先我们在编译是引入了一个vmlinux.h 文件, 里面有具体结构体的定义(以file_system_type) 为例:

struct file_system_type {
        const char *name;
        int fs_flags;
        int (*init_fs_context)(struct fs_context *);
        const struct fs_parameter_spec *parameters;
        struct dentry * (*mount)(struct file_system_type *, int, const char *, void *);
        void (*kill_sb)(struct super_block *);
        struct module *owner;
        struct file_system_type *next;
        struct hlist_head fs_supers;
        struct lock_class_key s_lock_key;
        struct lock_class_key s_umount_key;
        struct lock_class_key s_vfs_rename_key;
        struct lock_class_key s_writers_key[3];
        struct lock_class_key i_lock_key;
        struct lock_class_key i_mutex_key;
        struct lock_class_key i_mutex_dir_key;
}

在编译bpf 程序的时候也会上对应变量名(类型debug信息), 然后在实际运行机器上通过内核的BTF找到变量在runtime的偏移量:

[189] STRUCT 'file_system_type' size=72 vlen=17
        'name' type_id=3 bits_offset=0
        'fs_flags' type_id=11 bits_offset=64
        'init_fs_context' type_id=1113 bits_offset=128
        'parameters' type_id=1115 bits_offset=192
        'mount' type_id=1117 bits_offset=256
        'kill_sb' type_id=1092 bits_offset=320
        'owner' type_id=207 bits_offset=384
        'next' type_id=952 bits_offset=448
        'fs_supers' type_id=174 bits_offset=512
        's_lock_key' type_id=201 bits_offset=576
        's_umount_key' type_id=201 bits_offset=576
        's_vfs_rename_key' type_id=201 bits_offset=576
        's_writers_key' type_id=1118 bits_offset=576
        'i_lock_key' type_id=201 bits_offset=576
        'i_mutex_key' type_id=201 bits_offset=576
        'invalidate_lock_key' type_id=201 bits_offset=576
        'i_mutex_dir_key' type_id=201 bits_offset=576

这样当我们访问file_system_type->i_mutex_dir_key的时候他就知道具体结构体中的偏移量.

函数也是一样, 在vmlinux中的定义:

typedef u64 (*btf_bpf_trace_printk)(char *, u32, u64, u64, u64);

在B TF中的定义:

[8188] TYPEDEF 'btf_bpf_trace_printk' type_id=8189
[8189] PTR '(anon)' type_id=8190
[8190] FUNC_PROTO '(anon)' ret_type_id=60 vlen=5
        '(anon)' type_id=16
        '(anon)' type_id=59
        '(anon)' type_id=60
        '(anon)' type_id=60
        '(anon)' type_id=60

具体参数类型通过type_id 不断的查找.

书中后面部分基本和上面golang的代码差不多就不展示, 了不过BPF_CORE_READ()这种宏确实好用, 不然指针必须一级一级的去拿,写起来太别扭.

Back To Top

lzh2nix commented 1 year ago

CH6 The eBPF verifier(2023.7.22)

verifier 的角色就是验证在加载你的ebpf程序的时候保证他是安全 ,防止对内核造成破坏.

其实就是检查你代码中的各种异常(通过eval而非executing的方式), 检查的内容和你使用工具检查python是一样的.

是不是正确的使用了help function
help function的参数是否合法
check pointer的引用情况
ctx的使用姿势
loop 是否死循环
return code 是否合法
opcode 是否合法
无效的instruction

Back To Top

lzh2nix commented 1 year ago

CH7 eBPF Program and attachment Types(2023.8.30)

ebpf 相关的程序分为两类 tracing和networking

tracing 又有以下几类:

kprobe/kretprobe 除了 /sys/kernel/debug/kprobes/blacklist 之外基本上所有的内核函数都可以probe.
tracepoint/rawtracepoint:
fentry/fexit(x86 5.5 ARM 6.0)
perf events
uprobe/uretprobe
networking 相关

基本的几个example 都可以在cilium 里找到 https://github.com/cilium/ebpf/tree/main/examples.

写ebpf 代码的几个参考对象:

Back To Top

lzh2nix / articles

ebpf之路(2023版本) #169

开篇

Content

Learning eBPF Preface && CH1(2023.07.01)

CH2(2023.7.4)

BPF_MAP_TYPE_HASH

BPF_MAP_TYPE_PERF_EVENT_ARRAY

CH3 Anatomy of an eBPF program(2023.7.10)

CH4 The bpf() system call(2023.07.11)

CH5 CO-RE, BTF, and libbpf(2023.07.14)

CH6 The eBPF verifier(2023.7.22)

CH7 eBPF Program and attachment Types(2023.8.30)