osandov / drgn

Programmable debugger
Other
1.71k stars 157 forks source link

DWARFless Debugging #176

Open brenns10 opened 2 years ago

brenns10 commented 2 years ago

Last updated: 2024-03-28

This issue tracks support for non-DWARF sources of debugging information: specifically for the Linux kernel, but hopefully including userspace as we go. I'm editing this initial issue comment as the project takes shape, so hopefully this can provide at-a-glance status information.

Overview

Drgn needs several kinds of information to understand and debug programs. It uses type information, symbols, "object finding", unwind info, file+line number mappings, and probably other kinds of info I missed. Currently, all of this information comes from DWARF debug information (except symbols, which are just from the ELF symbol table). Some of these kinds of informations have extensible APIs (such as object finders and type finders) but others, such as symbol tables, aren't extensible.

This issue tracks the work necessary to support non-DWARF sources of debugging information. While DWARF is the major contender in this areas, it has been criticized for being a bit "heavy" (large file sizes). As a result, the DWARF information is typically stripped from binaries in common Linux distributions, and commonly packaged in a separate package (e.g. "foo-debuginfo" for package "foo"). Sometimes, distributions offer debuginfod which is a way to serve the relevant debuginfo files only when necessary, without the need to install the debuginfo packages. Drgn has support for this (if built from source, not installed from pip), and the support will be improving soon. Debuginfod is great, but sadly not an option for many.

In some cases, it may not be possible to install DWARF debuginfo: maybe it was never generated in the first place, maybe it was stripped and not placed in a debuginfo package or debuginfod server, or maybe there's no internet access, lack of disk space, or a strict policy against installing debuginfo packages (yes, really, I've had to deal with that). Then, you might want to use a more compact format, such as the Compact Type Format (CTF), or BPF Type Format (BTF).

The Linux kernel frequently comes with BTF data built in (depending on config). Some linux distributions (Oracle Linux w/ UEK kernel) come with CTF data packaged in the normal kernel package. The Linux kernel also comes with a symbol table - kallsyms (again, depending on config). The Linux kernel also frequently uses either frame pointers, or a stack unwinding information format called ORC, for stack unwinding. If all these pieces could be combined, then most of the features of drgn could be usable without the DWARF information. That's what this issue is about.

Objectives

The currently agreed upon end goal here is to get a pluggable symbol finder and vmlinux kallsyms implementation merged and available by default for Drgn. These are useful in and of themselves:

Once these are available, we will get a basic CTF implementation for Linux kernel merged. The basic ground rules here are that it will be disabled at compile-time in the PyPI wheel distributions, and won't muddle into the internals of Drgn. Essentially, there should be some build-system related changes, and a file named ctf.c, and maybe some python wrappers, and that's it.

Here are some non-goals at the moment. They may be revisited in time.

  1. CTF support for userspace programs. Any program compiled with -gctf has a .ctf section. Compare that to the kernel implementations, which now just create a vmlinux.ctfa (ctfa = CTF Archive) file. As of now, the CTF implementation does support very simple userspace cores in order to add simple unit tests. However, proper support will need better integration with the drgn debug info system, so it won't be officially supported for the initial step.
  2. Automatic detection and use of CTF data. This would require tangling the CTF implementation into the core a bit more than we want to do initially. Also, it might conflict with some of the work on the Module API.
  3. Enabling CTF support in PyPI wheels.

Roadmap

  1. [x] First pull request: #316 Add VMCOREINFO to the Linux special objects. This is not controversial, it's just a nice piece of information to expose to Python helpers. The current design of my helpers does need this, but even if it didn't, it would be a useful piece.
  2. [x] Second pull request: #241 Pluggable Symbol finder API. Allows C and Python to register "symbol finders". In review
  3. [ ] Third pull request: #388 Add symbol finders for kallsyms (vmlinux & module)
  4. [ ] Fourth pull request: Adding CTF implementation. (see ctf branch).
  5. [ ] Possible fifth pull request: Support ORC unwinding without ELF files. (see ctf branch).

Current branches

These are links to branches that contain my current work, and they do roughly correspond to different points in the roadmap above. They are stacked, each one building on the prior one. They are subject to being rebased and force pushed at any time.

  1. ~symbol_finder - this branch adds the pluggable symbol finder API.~
  2. kallsyms_finder - this branch adds the kallsyms implementation
  3. ctf - this branch adds the CTF implementation. It also has the necessary plumbing to use ORC for unwinding, without needing to read it from ELF debuginfo files.
  4. btf_2024 - this branch adds a small BTF implementation. Due to the (current) nature of BTF on Linux, type definitions are provided only for functions and some percpu variables. The branch has some workaround helpers for this, but it will be some separate work to add variable definitions into BTF in the upstream Linux kernel.

If you take the latest branch (ctf) on an Oracle Linux 9 machine using UEK, then you should be able to build it and install it against a local kernel without installing any debuginfo packages!

sudo dnf config-manager --enable ol9_developer_EPEL
sudo dnf config-manager --enable ol9_developer
sudo dnf config-manager --enable ol9_appstream
sudo dnf config-manager --enable ol9_codeready_builder

sudo dnf install -y make autoconf automake libtool gcc-c++ git \
                    python3-devel elfutils-devel binutils-devel \
                    bzip2-devel zlib-devel xz-devel \
                    libkdumpfile-devel

cd drgn
python setup.py build_ext -i
sudo python -m drgn
# tada

Future Work

Old Branches & Work

I have created a few prototype branches on older drgn versions. Only the ones mentioned above are actively maintained and developed. The ones below are older and no longer maintained. For the most part, the commits in these branches were used as the basis for more recent branches, so it's not like the work is lost. The below list is from oldest to newest.

brenns10 commented 1 year ago

Update today:

brenns10 commented 1 year ago

Updated today:

Some of the new things in there include:

brenns10 commented 1 year ago

Updated today:

A tentative roadmap based on discussions with Omar. Link to the first (small) pull request of several in the series.

brenns10 commented 1 year ago

Some more technical notes from the discussion so I remember them when I work on the relevant parts:

  1. Ownership of symbol names is tricky once you start allowing Python code to create symbols -- or any code with shorter lifetimes than the Program. My current solution is to add a flag (name_owned) to the symbol object, which indicates that ownership of the string is passed into the symbol, and it will be freed when the symbol is destroyed. However, we discussed an alternative, where the Python layer creates a set of interned strings, and each time a symbol is created, we check if the string exists already. If so, we just use the interned copy. If not, we add it to the set and use it. After a bit of a back and forth, we couldn't really decide on which approach is better. I offered to implement the alternative where we intern the strings so we have a point of comparison.
  2. The symbol finder API currently does not expose module information at all, but that would be extremely useful (or just necessary). When looking up symbols by address, we should unconditionally include the module information if not already present, since it should be a relatively cheap tree lookup. We can allow the symbol finder to take the module as an additional constraint. However, until the module API is exposed to Python, this will have to be omitted for the Python version of this API.
  3. We were frequently nerd sniped in our discussion, and one mostly unrelated project / idea that I mentioned was to allow the user to override the vmcoreinfo (or the information contained therein) used by drgn. This would allow Drgn to more gracefully handle cases where the info is wrong or missing. I'm interested in this problem but it's fully unrelated to the task at hand :)
brenns10 commented 1 year ago

Note to self: update CTF branch for 30ecdd901ea3b748c74fd86cd8448ba59dc9a6ef

brenns10 commented 8 months ago

The ctf, symbol_finder, and kallsyms_finder branch are updated based on my latest rebase, which is on main, which is currently 3 commits ahead of 0.0.25, so this is more or less the 0.0.25 update.

With these changes, on one benchmark I have (which reads 100k elements from the dentry hash table), the CTF implementation now outperforms the DWARF:

brenns10 commented 7 months ago

I've updated the ctf branch with some new work that enables the use of ORC to unwind stacks when there are no ELF debuginfo files loaded.

As it is now, drgn reads ORC only from the ELF files it has already located. If there are no ELF debuginfo files, there's no DWARF. A consequence of this is that the unwinder is tightly coupled with the drgn_module API, and as it is now, when there is no ELF debuginfo file, there is no corresponding drgn_module with which to associate ORC info.

So my changes add a binary tree mapping address ranges directly to ORC information, and this can be used regardless of what debuginfo is loaded. I'd expect that in the future as this goes upstream, there will be a better solution (e.g. the module API refactor) which eliminate the need for this.

brenns10 commented 5 months ago

Updated with the latest progress, #241 is merged and #388 is filed. Also updated some of the discussion related to ORC and userspace support (ORC support is now present in the ctf branch and userspace is partially supported in order to help with unit testing).

oshaked1 commented 3 months ago

Hi @brenns10, first of all I want to say I really appreciate your work on this set of features. I am working on a project that will use drgn as an automatic incident response tool on production servers and these features will be very useful for me. Specifically, I'm very interested in BTF support as most servers I will be encountering do not come with CTF.

Can you give a quick update on the status of BTF support? Will it be available anytime soon? Is there any way I can help with it?

Thanks a lot!

brenns10 commented 3 months ago

Hi @oshaked1 thanks, I'm glad this feature is interesting to you!

In terms of drgn development progress, BTF and CTF are both type formats, and so both of them require using kallsyms as a symbol table. Basically, steps 1-3 (as well as 5) on my roadmap are shared for both CTF and BTF -- the only real difference is which type format gets used. So once step 3 is done (#388), a usable BTF implementation could be written & merged independently of CTF. My very old BTF branch here will be a good place to start.

The major snag is that currently, the Linux kernel does not include BTF for kernel variables, just functions. So, while the BTF implementation could be updated and prepared for submission, it would not be useful until we get the Linux kernel (and its BTF generation program, pahole) to generate BTF for variables as well as functions. On the other hand CTF (which is not upstream yet, so it's another major snag) has that information already.

For my development plans, I'm prioritizing the CTF implementation first, but I do plan to work on BTF after that. If you or somebody else wanted to take up that work in order to see it merged faster, I'd be happy to help with design & review.

oshaked1 commented 2 months ago

Hi @brenns10, thanks for the reply and sorry for the delay. For my use case, I can live without having types for kernel variables, as long as I can specify them manually. Would it be useful for the project if I were to contribute a solution that doesn't address the variable type issue? It would still be quite useful for use cases where the types of the variables being accessed are known, e.g. scripts. I could also leave my branch public so it can be used or built upon when the Linux kernel tooling is ready.

Regarding implementation - I saw you mentioned switching to using libbtf. Did you mean libbpf? Also, how up to date is the code in your BTF branch as far as integration with the drgn framework goes? Have there been significant changes to the type finder system?

brenns10 commented 2 months ago

Would it be useful for the project if I were to contribute a solution that doesn't address the variable type issue?

The BTF type finder itself in drgn would look pretty much the same, regardless of whether Linux actually contained the types for the variables or not. So from that perspective, it would be useful! But without the Linux support, the code in drgn would not be terribly useful. Even though BTF currently contains many types, it only contains those referenced by the functions and percpu variables it covers. While this is a lot of types, it's not all types, so you wouldn't always be able to write scripts even if you specified the type for a variable.

Regardless, it would be helpful to have the BTF branch updated since we would love to be able to use BTF someday, and it's a step in the right direction.

Did you mean libbpf?

Yes, I meant libbpf, sorry! I saw that there are some BTF-related functions in it, and I haven't evaluated whether it is possible to use it for BPF.

Honestly, I think I could get the BTF branch up and running on a much more recent version of drgn over the course of a couple hours. There's not a ton of drgn-specific stuff that has changed; instead, it's mainly that my kallsyms support has improved a lot. It perhaps a bit silly of me to suggest you update it, given that I have all the context, and I haven't done much in terms of documenting things! I will try to take the time to take a crack at it tomorrow, and report back on whether it was as straightforward as I hope it is.

brenns10 commented 2 months ago

Ok, I pushed a new branch btf_2024 which contains my updated BTF implementation. The work wasn't too difficult, the time spent on it was:

  1. Rebasing the CTF branch on the latest main branch. This was necessary due to the new changes to the symbol finder API.
  2. Updating the BTF implementation with a few minor API differences. The only major API difference was that type finders now accept a bitfield of requested type kinds, rather than being called once for each requested type kind.
  3. Updated the BTF implementation to contain support for the new BTF_KIND_ENUM64 for 64-bit enums. To be fair this is entirely untested, but it "should work".

It's based directly on top of the CTF branch, so it can make use of the existing support for ORC & kallsyms. I'd recommend reading the commit messages of the two BTF-related commits at the top of this branch.

I tested it on a 5.15-based vmcore I had laying around, but you can also run it on the latest vmtest kernels. Here's an example:

python -m vmtest -k 6.9*
# within the virtual machine:
python -m drgn --no-default-symbols
>>> load_btf(prog)
>>>

Like I said, there's no general mapping from variable name to type, which means that for most variables, you'd need to explicitly specify the type. I added a var function as a syntax sugar for it.

>>> prog["slab_caches"]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
KeyError: 'slab_caches'
>>> Object(prog, "struct list_head", address=prog.symbol("slab_caches").addres)(struct list_head){
        .next = (struct list_head *)0xffffa273022b8168,
        .prev = (struct list_head *)0xffffa27301042068,
}
>>> from drgn.helpers.linux.btf import var
>>> var(prog, "slab_caches", "struct list_head")
(struct list_head){
        .next = (struct list_head *)0xffffa273022b8168,
        .prev = (struct list_head *)0xffffa27301042068,
}

I did add a few core variable types into a special "hardcoded" object finder, because there are some kernel variables that drgn tries to access internally, and having the types handy prevents them from failing. But there's not really any guarantee that all the types you'd like to refer to will be present, unfortunately. However, it's enough to get drgn's built-in thread API to work, and thanks to the already present ORC plumbing, the stack tracing even works!

>>> for thread in prog.threads():
...     print(thread.object.comm.string_().decode())
...     print(thread.stack_trace())
...
init
#0  __schedule+0x4d0/0x512
#1  schedule+0x2a/0x41
#2  do_wait+0xcb/0xf5
#3  kernel_wait4+0xd8/0x131
#4  __do_sys_wait4+0x49/0x9e
#5  do_syscall_64+0x82/0xe0
#6  entry_SYSCALL_64_after_hwframe+0x76/0x7e
#7  0x7ff936820a7a
kthreadd
#0  __schedule+0x4d0/0x512
#1  schedule+0x2a/0x41
#2  kthreadd+0x72/0x11f
#3  ret_from_fork+0x20/0x35
#4  ret_from_fork_asm+0x1a/0x30
pool_workqueue_
#0  __schedule+0x4d0/0x512
#1  schedule+0x2a/0x41
#2  kthread_worker_fn+0x154/0x1b8
#3  kthread+0xe0/0xeb
#4  ret_from_fork+0x20/0x35
#5  ret_from_fork_asm+0x1a/0x30
kworker/R-rcu_g
#0  __schedule+0x4d0/0x512
#1  schedule+0x2a/0x41
#2  rescuer_thread+0x224/0x23e
#3  kthread+0xe0/0xeb
#4  ret_from_fork+0x20/0x35
#5  ret_from_fork_asm+0x1a/0x30
...
oshaked1 commented 2 months ago

Looks great! Thanks so much for your effort, definitely would have taken me much longer to figure this out. The var function is just what I was looking for. I'm having some trouble building your branch, I'm using scripts/build_dists.sh, but after all wheels are built I get the following error:

Processing /tmp/manylinux_wheels/drgn-0.0.26+unknown-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing collected packages: drgn
Successfully installed drgn-0.0.26+unknown
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
+ /opt/python/cp310-cp310/bin/drgn --version
Traceback (most recent call last):
  File "/opt/python/cp310-cp310/bin/drgn", line 5, in <module>
    from drgn.cli import _main
  File "/opt/_internal/cpython-3.10.14/lib/python3.10/site-packages/drgn/__init__.py", line 49, in <module>
    from _drgn import (
ImportError: /opt/_internal/cpython-3.10.14/lib/python3.10/site-packages/_drgn.cpython-310-x86_64-linux-gnu.so: undefined symbol: ctf_dict_open

I get the same error when using the built wheel outside the build container as well.

Before that I also got the following error:

../../libdrgn/btf.c: In function ‘drgn_btf_lookup_enumval’:
../../libdrgn/btf.c:822:2: error: a label can only be part of a statement and a declaration is not a statement
  822 |  struct drgn_qualified_type qt;
      |  ^~~~~~
../../libdrgn/btf.c:823:2: error: expected expression before ‘struct’
  823 |  struct drgn_error *err = drgn_btf_type_create(bf, tid, &qt);
      |  ^~~~~~
../../libdrgn/btf.c:824:6: error: ‘err’ undeclared (first use in this function)
  824 |  if (err)
      |      ^~~
../../libdrgn/btf.c:824:6: note: each undeclared identifier is reported only once for each function it appears in

Which I fixed by moving the variable declarations to the beginning of the function.

brenns10 commented 2 months ago

You can fix the undefined ctf_dict_open() issue by reverting commit 9639779db0fc004aa05499de612e444f12f6116c so that the build doesn't try to use CTF in the container. I haven't actually tried to do that in a while and I guess I need to revisit and debug that. You can also set CONFIGURE_FLAGS=--with-libctf=no for local builds.

Unfortunately libctf doesn't have pkg-config scripts and so every distro / system has slightly different linker flags, it's a bit difficult to get it work generally. Better to turn it off if you're just using BTF.

brenns10 commented 2 months ago

Force-pushed btf_2024 with fixes:

edit: and now I've fixed the test failures

oshaked1 commented 2 months ago

It works now, thanks!

brenns10 commented 2 months ago

Excellent! Your feedback has already been helpful, but I'd appreciate any more you have as you use it :)

oshaked1 commented 1 week ago

Hi just reporting a tiny issue - the load_btf() function fails when there are no modules (encountered this on WSL where there are no kernel modules):

File "/usr/local/lib/python3.10/dist-packages/drgn/helpers/linux/btf.py", line 79, in load_btf
    module_finder = load_module_kallsyms(prog)
  File "/usr/local/lib/python3.10/dist-packages/drgn/helpers/linux/kallsyms.py", line 254, in load_module_kallsyms
    return SymbolIndex(all_symbols)
ValueError: symbol finder must contain at least one symbol

Everything works fine when ignoring the exception.

brenns10 commented 1 week ago

Great timing, you're in luck. That restriction is being removed in the upstreaming, #388. I fixed it Friday and it will be in the next version of PR. I will rebase the CTF and BTF patch sets with those updates for the new revision as well.

oshaked1 commented 1 week ago

Awesome, I finally got around to integrating your BTF patch into my project so I may have some additional feedback in the upcoming days.

I found myself adding quite a few hardcoded types that are required by existing helpers so I will share them with you when I'm done.

brenns10 commented 1 week ago

That's great, I'm so glad you're finding this useful!

I'll be glad to incorporate any of the hard-coded types into the BTF branch. I'm still aiming to get the variable types included in the kernel BTF but improving the usability as it is now is a great temporary measure.

oshaked1 commented 1 week ago

Here are the hardcoded types I added that are needed for some existing helpers, with some logic for types that depend on specific configurations:

drgn.helpers.linux.btf.HARDCODED_TYPES["slab_kset"] = "struct kset *"
drgn.helpers.linux.btf.HARDCODED_TYPES["slab_caches"] = "struct list_head"
drgn.helpers.linux.btf.HARDCODED_TYPES["min_low_pfn"] = "unsigned long"
drgn.helpers.linux.btf.HARDCODED_TYPES["max_pfn"] = "unsigned long"
drgn.helpers.linux.btf.HARDCODED_TYPES["saved_command_line"] = "char *"
drgn.helpers.linux.btf.HARDCODED_TYPES["net_namespace_list"] = "struct list_head"

# We have CONFIG_SPARSEMEM_EXTREME, mem_section type is `struct mem_section **`
try:
    self.prog.function('sparse_index_alloc')
    drgn.helpers.linux.btf.HARDCODED_TYPES["mem_section"] = "struct mem_section **"
# We don't have CONFIG_SPARSEMEM_EXTREME, mem_section type is `struct mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]`.
# We ignore the sizes as they are calculated in `linux_kernel_get_vmemmap_address`.
# TODO: make sure this works, this isn't tested.
except LookupError:
    drgn.helpers.linux.btf.HARDCODED_TYPES["mem_section"] = "struct mem_section[0][0]"

# Kernel >= 6.9, vmap_nodes is used
try:
    self.prog.symbol("vmap_nodes")
    drgn.helpers.linux.btf.HARDCODED_TYPES["vmap_nodes"] = "struct vmap_node *"
# Kernel < 6.9, vmap_area_list is used
except LookupError:
    drgn.helpers.linux.btf.HARDCODED_TYPES["vmap_area_list"] = "struct list_head"

Incorporating variable types in the kernel BTF is a great idea, but it would still be valuable to have the essential hardcoded types in drgn for backward compatibility.

brenns10 commented 1 week ago

I will say that the hardcoded types may have some issues when we finally go to review & merge the BTF portion of the branch. Not sure that they will stand up to code review!

Worst case I'm sure we could put them into the contrib directory, but I just wanted to let you know that, while I'm happy to add these into the branch, they may not make the final cut.

oshaked1 commented 1 week ago

If it's a matter of just calling a function from the contrib directory then that's completely fine :)