opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.77k stars 2.09k forks source link

Proposal: bindmount init's procfs #1224

Open cyphar opened 7 years ago

cyphar commented 7 years ago

So, I was playing around with different tricks you can do with persistent namespaces and I noticed that you can bindmount /proc/[pid] directories:

% mount --bind /proc/1 /tmp/init
% cat /tmp/init/stat
1 (systemd) S 0 1 1 0 -1 4210944 91669 332185088 31 11021 346 382 571372 98303 20 0 1 0 2 127012864 1248 18446744073709551615 1 1 0 0 0 0 671173123 4096 1260 0 0 0 17 0 0 0 581 0 0 0 0 0 0 0 0 0 0
% mount --bind /proc/self /tmp/self
% ls /tmp/self
ls: cannot open directory '/tmp/self': No such process

This means there are two things that we can solve with this:

  1. We can manage containers from a different PID namespace, because the /proc/self/stat pseudo-file will generate the correct PID for us. I'm not sure what happens if you try to access it from a different namespace though (probably ESRCH).

  2. We can remove the initProcessTime magic with a simple check to see if /run/runc/[ctr]/init gives us ESRCH. This is a fool-proof method because we're pinning the kernel struct task's procfs entry.

There's also some other cool stuff we could do.

crosbymichael commented 7 years ago
  1. I don't think that will work because you won't be able to do any syscalls on the pids.

  2. that is a really good idea for saving process information

cyphar commented 7 years ago

@crosbymichael

I don't that that will work because you won't be able to do any syscalls on the pids.

What do you mean? If you read the pid out from /proc/self/stat, the kernel will use pid_vnr to compute the "correct" PID in your current namespace (assuming that there is an ancestral relationship with the namespaces). I believe you get ESRCH if you try to read /proc/self/stat for a process which isn't mapped in your namespace (I'll have to check though).

crosbymichael commented 7 years ago

@cyphar Ya, I guess if you are reading from the host and the container namespaces are children. I thought you were talking about mapping it the other way, where the child had mounts to parent namespaces.