mountsnoop should be able to resolve symlink/magic-links

lukts30 commented 2 years ago

I tried debugging what mount call podman+crun makes during container creation. I tried using mountsnoop but that was not helpful since the mount calls involved /proc/self/fd/N.

3               23490  23490  4026534442 mount("cgroup2", "/proc/self/fd/9", "cgroup2", MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REC|MS_RELATIME, "") = 0

With strace:

14372  │ 17805 openat2(8, "sys/fs/cgroup", {flags=O_RDONLY|O_CLOEXEC|O_PATH, resolve=RESOLVE_IN_ROOT}, 24) = 9 
14373  │ 17805 mount("cgroup2", "/proc/self/fd/9", "cgroup2", 0, NULL) = 0

I expect more tools in the future making use of the newer openat2 so it would be an improvement if mountsnoop could print the realpath but I do not know if there is bpf helper for realpath.

chenhengqi commented 2 years ago

Replace the self with pid in the output and you can get the real path.

lukts30 commented 2 years ago

Do you mean that I can do that form within an ebpf program? Substituting self with the PID and then doing a realpath from userspace is too racy since the file descriptor gets recycled immediately after the mount.

chenhengqi commented 2 years ago

What is your use case ?

lukts30 commented 2 years ago

I was trying to understand what mount calls podman+crun make during container startup. As I figured out it does a few dozens openat2 and then uses the file descriptor via the procfs path in the mount call.

for path in paths do:
    fd=openat(path)
    mount(..., "/proc/self/fd/N, ..., ..., ...)
    close(fd)

If I tried a realpath(/proc/self/fd/9) from userspace it would most likely point to a different path since after the first iteration the kernel is free to reassign file descriptor 9.

For this specific use case, the snippet below does work since I know that file descriptor originates from an openat2 call but there are ways to acquire an fd without even passing the path directly via an open call to the kernel (Unix domain sockets SCM_RIGHTS or pidfd_getfd).

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat2 { printf("open: %s %s ", comm, str(args->filename)); } tracepoint:syscalls:sys_exit_openat2 { printf("fd=%d\n", args->ret); } tracepoint:syscalls:sys_enter_mount { printf("%s mount %s\n", comm, str(args->dir_name)); }'

chenhengqi commented 2 years ago

OK, I see.

I think you can create a new tool which combines the functionalities of opensnoop and mountsnoop.

lukts30 commented 2 years ago

I think you can create a new tool which combines the functionalities of opensnoop and mountsnoop.

But does opensnoop or any other tools handle externally acquired file descriptors (e.g. pidfd_getfd)? It seems likely that one could still correlate them but at least the fd would be different.

Process A: fd=open(path), open returns 9, Process A sends only the number 9 to Process B. Process B: pidfd_getfd(A,9), pidfd_getfd returns 7 and then does something that involves /proc/self/fd/7.

So, in this case, both A as well as B have file descriptors with different numbers but both refer to the same file in the kernel. Something like this is used by lxd/lxc to intercept syscalls from unprivileged processes and redo them in a supervisor process.

The only solution that covers all these scenarios would be to have something in ebpf that resolves symlinks.

chenhengqi commented 2 years ago

You can trace both open and pidfd_getfd, and use a BPF map to store the path.

iovisor / bcc

mountsnoop should be able to resolve symlink/magic-links #3842