osandov / drgn

Programmable debugger
Other
1.73k stars 157 forks source link

Question: Is drgn suitable for kernel monitor? #102

Closed ericjoy1 closed 3 years ago

ericjoy1 commented 3 years ago

Hi all, I intend to use drgn to do kernel monitor work, because many resource statistics like percpu runqueues, all process and threads and all numa nodes and so on are already saved in kernel memory.

So I don't need to use ebpf progs to trace these activities again and save in ebpf maps again, I think I can just use drgn scripts to read them periodically out from kernel kcore.

But from running a drgn ps script, I see so many pread syscalls. So my question is drgn suitable for these periodic kernel monitor work considering from performance and cpu cost? Or ebpf is more lightweight than drgn to do these resource monitor work?

Thanks.

osandov commented 3 years ago

The answer in your case will depend on the specifics. In general, BPF is more lightweight than drgn because BPF has the privilege of running directly in the kernel. drgn runs in userspace and has to access kernel memory through /proc/kcore; most variable accesses have to do a read syscall as a result.

drgn scripts can be optimized to read from /proc/kcore fewer times (e.g., with judicious use of Object.read_()). And, depending on what exactly you're collecting and how often, it's possible that running a drgn script periodically is still better than adding the overhead of a BPF program to whatever event it is attached to.

If you can share more details, I might be able to give more specific tips, but like I said it really depends on what exactly you're doing and you might be better off testing both alternatives.

ericjoy1 commented 3 years ago

The answer in your case will depend on the specifics. In general, BPF is more lightweight than drgn because BPF has the privilege of running directly in the kernel. drgn runs in userspace and has to access kernel memory through /proc/kcore; most variable accesses have to do a read syscall as a result.

Thanks for the explanation, so the main cost of drgn is the syscalls and memory copy. I wonder if the kcore can support mmap interface for reading kernel memory? But from the kernel source code, I see no mmap support in kcore. And I could encounter "address not mapped" problems when using drgn, which I think maybe an unavoidable result of using kcore because the pages are unmapped meanwhile?

drgn scripts can be optimized to read from /proc/kcore fewer times (e.g., with judicious use of Object.read_()). And, depending on what exactly you're collecting and how often, it's possible that running a drgn script periodically is still better than adding the overhead of a BPF program to whatever event it is attached to.

Yes, agreed. I encountered a few cases in which the BPF progs and maps updates cause high cpu usage and spinlock contention. And the cost of kprobe and kretprobe maybe a little high.

If you can share more details, I might be able to give more specific tips, but like I said it really depends on what exactly you're doing and you might be better off testing both alternatives.

Actually, we are trying to develop a kernel monitor tool to help us find what's wrong in kernel and send out warning. So what to be sampled using drgn or traced using ebpf are of many kinds. Like cpu scheduler latency of some important processes, userspace and kernelspace memory usage of some memcg, bio and request latency and other more specific statistics in the kernel. Maybe I better do some testing like you said ;-)

osandov commented 3 years ago

Sorry for the late reply. I did still want to share some thoughts below.

Thanks for the explanation, so the main cost of drgn is the syscalls and memory copy. I wonder if the kcore can support mmap interface for reading kernel memory? But from the kernel source code, I see no mmap support in kcore. And I could encounter "address not mapped" problems when using drgn, which I think maybe an unavoidable result of using kcore because the pages are unmapped meanwhile?

Right, /proc/kcore doesn't support mmap, and I'm not sure whether it could. The direct mapping region might be doable since that only changes if memory is hot-plugged/unplugged. vmalloc would be a big problem since it changes all of the time at runtime, and keeping the mmap'd mappings up to date after a vmalloc()/vfree() sounds expensive.

That being said, there are some opportunities to optimize /proc/kcore reads. @sdimitro recently got a big speedup for vmalloc in this commit. There is some low-hanging fruit for normal kernel reads as well. They currently do a page table check which is probably unnecessary because they use copy_from_kernel_nofault() later. They also do an extra copy through a bounce buffer to avoid triggering hardened usercopy. It might be possible to bypass the hardening check instead of doing the extra copy. I'll open a separate issue for optimizing /proc/kcore.

osandov commented 3 years ago

More details in #106.