iovisor / bcc

BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more
Apache License 2.0
20.63k stars 3.89k forks source link

`bcc_elf_is_exe` should use `stat` instead of `access` to check execute permissions #5140

Open Xyene opened 2 weeks ago

Xyene commented 2 weeks ago

I think the implementation of bcc_elf_is_exe

https://github.com/iovisor/bcc/blob/d9eec93cfe46f1133aa8769f13b2550ca7edd08d/src/cc/bcc_elf.c#L1054-L1056

is subtly wrong, and should be using stat instead of access here.

For the single usage in bcc itself, the difference ~doesn't matter, but the current logic broke some usages downstream in bpftrace, until the bpftrace devs (accidentally) happened to remove the offending call in https://github.com/bpftrace/bpftrace/commit/19df2b1bf8fc018f7a0b4695473c36a5d8752980.

I did spend two hours tracking down what I think is a kernel bug (on top of a bcc bug) here, even if it's not super consequential, so I'm going to document it below to help the next person who Googles this.

Absent of the kernel bug, I think access might not semantically be what's desired here. From man 2 access:

This allows set-user-ID programs and capability-endowed programs
to easily determine the invoking user's authority.  In other
words, access() does not answer the "can I read/write/execute
this file?" question.  It answers a slightly different question:
"(assuming I'm a setuid binary) can the user who invoked me
read/write/execute this file?", which gives set-user-ID programs
the possibility to prevent malicious users from causing them to
read files which users shouldn't be able to read.

Now the kernel bug(?): I'm pretty sure that access on procfs is broken. Or at least, if working as intended, has undocumented and confusing semantics.

Assuming uid 5 is an unprivileged user, who is running a regular process with pid 223674, the following is true on my 6.1 kernel:

# Succeeds as root
>>> os.access('/proc/223674/exe', os.X_OK), os.getuid()
(True, 0)

# Fails as a regular user
>>> os.access('/proc/223674/exe', os.X_OK), os.getuid()
(False, 5)

There's nothing special about this file, and stat behaves as you'd expect it to:

$ stat /proc/245043/exe
  File: /proc/245043/exe -> /tmp/demos.exe
  Size: 0               Blocks: 0          IO Block: 1024   symbolic link
Device: 15h/21d Inode: 2052196     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (5/user)   Gid: (7/    tech)
Access: 2024-11-12 09:06:22.300926738 -0500
Modify: 2024-11-12 09:06:22.279926617 -0500
Change: 2024-11-12 09:06:22.279926617 -0500
 Birth: -

A trace of what access ends up executing before failing looks like:

image

In particular, in

https://github.com/torvalds/linux/blob/2d5404caa8c7bb5c4e0435f94b28834ae5456623/security/commoncap.c#L151-L153

ns_capable(..., CAP_SYS_PTRACE) falls through, so we EPERM. Except this is all pretty obviously wrong: we don't need CAP_SYS_PTRACE to dereference /proc/$pid/exe and execute it.

(This is where I stopped looking, because I discovered upstream bpftrace had already removed the call to bcc_elf_is_exe that was tickling this bug.)

yonghong-song commented 1 week ago

@Rtoax Could you help take a look?