hashicorp / nomad-driver-exec2

An official Nomad task driver plugin for sandboxing workloads using native Linux kernel features
Mozilla Public License 2.0
8 stars 0 forks source link

exit 126 error on plugin shim execution hard to debug #20

Open tgross opened 6 months ago

tgross commented 6 months ago

I've been testing the plugin out and haven't been able to execute any commands with busybox. I'm sure there's a configuration value or environment setup I'm missing, but this will also illustrate how challenging it is to debug issues. From the perspective of the job author, all I'm getting is:

Recent Events:
Time                       Type        Description
2024-05-10T14:36:41-04:00  Restarting  Task restarting in 16.579299116s
2024-05-10T14:36:41-04:00  Terminated  Exit Code: 126
2024-05-10T14:36:41-04:00  Started     Task started by client
2024-05-10T14:36:41-04:00  Task Setup  Building Task Directory
2024-05-10T14:36:41-04:00  Received    Task received by client

An example failing task config is:

config {
  command = "/usr/bin/busybox"
  args = ["nc", "-lp", "8001"]
}

The logs look like the following (allocID replaced by $alloc_id for legibility):

2024-05-10T13:55:26.318-0400 [TRACE] client.driver_mgr.nomad-driver-exec2: resources: driver=exec2 cpuset="" memory="1.048576e+08" memory_max=0 @module=exec2 compute=2040 timestamp=2024-05-10T13:55:26.318-0400 2024-05-10T13:55:26.318-0400 [INFO] client.driver_mgr.nomad-driver-exec2: exec2 runner: driver=exec2 @module=exec2 args=[nc -lp 8001] cmd=/usr/bin/busybox unveil_defaults=true unveil_paths=[rx:/usr/bin/busybox, rwxc:/run/nomad/dev/data/mounts/$alloc_id-http/local, rwxc:/run/nomad/dev/data/mounts/$alloc_id-http/alloc] timestamp=2024-05-10T13:55:26.318-0400 2024-05-10T13:55:26.319-0400 [TRACE] client.alloc_runner.task_runner: setting task state: alloc_id=$alloc_id task=http state=running 2024-05-10T13:55:26.319-0400 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=$alloc_id task=http type=Started msg="Task started by client" failed=false ... 2024-05-10T13:55:26.319-0400 [TRACE] client.driver_mgr.nomad-driver-exec2: waiting on task: driver=exec2 @module=exec2 id=$alloc_id/http/f3e261f0 timestamp=2024-05-10T13:55:26.319-0400 2024-05-10T13:55:26.482-0400 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=$alloc_id task=http type=Terminated msg="Exit Code: 126" failed=false

I used strace to follow the plugin process (ex. sudo strace -o /tmp/strace.log -f -p $pid -e trace=execve -v -s 256) and see the following relevant execve calls. The last one of which is where it's failing when we try to re-exec as the shim:

425822 execve("/usr/bin/nsenter", ["nsenter", "--no-fork", "--net=/var/run/netns/$alloc_id", "--", "unshare", "--ipc", "--pid", "--mount-proc", "--fork", "--kill-child=SIGKILL", "--setuid=86799", "--setgid=86799", "--", "/home/tim/ws/nomad/etc/plugins/nomad-driver-exec2", "exec2-shim", "true", "/run/nomad/dev/data/mounts/$alloc_id-http/alloc/logs/.http.stdout.fifo", "/run/nomad/dev/data/mounts/$alloc_id-http/alloc/logs/.http.stderr.fifo", "rwxc:/run/nomad/dev/data/mounts/$alloc_id-http/local", "rwxc:/run/nomad/dev/data/mounts/$alloc_id-http/alloc", "--", "/usr/bin/busybox", "nc", "-lp", "8001"], ["USER=nomad-86799", ... environment variables redacted ...]) = 0 ... 425822 execve("/usr/bin/unshare", ["unshare", "--ipc", "--pid", "--mount-proc", "--fork", "--kill-child=SIGKILL", "--setuid=86799", "--setgid=86799", "--", "/home/tim/ws/nomad/etc/plugins/nomad-driver-exec2", "exec2-shim", "true", "/run/nomad/dev/data/mounts/$alloc_id-http/alloc/logs/.http.stdout.fifo", "/run/nomad/dev/data/mounts/$alloc_id-http/alloc/logs/.http.stderr.fifo", "rwxc:/run/nomad/dev/data/mounts/$alloc_id-http/local", "rwxc:/run/nomad/dev/data/mounts/$alloc_id-http/alloc", "--", "/usr/bin/busybox", "nc", "-lp", "8001"], ["USER=nomad-86799", ... environment variables redacted ...]) = 0 ... 425856 execve("/home/tim/ws/nomad/etc/plugins/nomad-driver-exec2", ["/home/tim/ws/nomad/etc/plugins/nomad-driver-exec2", "exec2-shim", "true", "/run/nomad/dev/data/mounts/$alloc_id-http/alloc/logs/.http.stdout.fifo", "/run/nomad/dev/data/mounts/$alloc_id-http/alloc/logs/.http.stderr.fifo", "rwxc:/run/nomad/dev/data/mounts/$alloc_id-http/local", "rwxc:/run/nomad/dev/data/mounts/$alloc_id-http/alloc", "--", "/usr/bin/busybox", "nc", "-lp", "8001"], ["USER=nomad-86799", ... environment variables redacted ...]) = -1 EACCES (Permission denied) 425856 +++ exited with 126 +++

I can run commands like sleep or env via this plugin without trouble, so the basic environment seems to be fine. As expected, my kernel environment looks good:

$ uname -a
Linux continuity 5.15.0-106-generic #116-Ubuntu SMP Wed Apr 17 09:17:56 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

$ journalctl -kg landlock
May 10 08:21:02 continuity kernel: landlock: Up and running.

The plugin and busybox binaries are executable and the dynamic libraries that busybox needs should fall into the default unveil paths:

$ ls -lah ~/ws/nomad/etc/plugins/nomad-driver-exec2
-rwxrwxr-x 1 tim tim 25M May 10 14:39 /home/tim/ws/nomad/etc/plugins/nomad-driver-exec2

$ ls -lah /usr/bin/busybox
-rwxr-xr-x 1 root root 707K Feb  4  2022 /usr/bin/busybox

$ ldd $(which busybox)
        linux-vdso.so.1 (0x00007ffe505de000)
        libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007f0824ae0000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f08248b7000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f0824bbd000)

My configuration is as follows:

plugin "nomad-driver-exec2" {
  config {
    unveil_defaults = true
    unveil_paths    = []
    unveil_by_task  = true
  }
}

Although I also tried explicitly granting access to each item needed:

plugin "nomad-driver-exec2" {
  config {
    unveil_defaults = true
    unveil_paths    = [
      "rx:/lib/x86_64-linux-gnu/libresolv.so.2",
      "rx:/lib/x86_64-linux-gnu/libc.so.6",
      "rx:/lib64/ld-linux-x86-64.so.2",
      "rx:/usr/bin/busybox",
    ]
    unveil_by_task  = true
  }
}
tgross commented 6 months ago

As I noted in our sidebar discussion, the issue here is that the ephemeral user didn't have access to my homedir, which is where I had the plugin binary for testing. Nothing to do here except maybe update docs.

schmichael commented 1 month ago

Reopening because there is a bug! The bug is that the driver does not capture any output from unshare and nsenter:

https://github.com/hashicorp/nomad-driver-exec2/blob/v0.1.0/pkg/shim/shim.go#L346-L347

The comment event says:

// nsenter and unshare do not log

But they do log on errors like this! If you run the command manually you'll see the unshare output to stderr:

$ sudo unshare --ipc --pid --mount-proc --fork --kill-child=SIGKILL --setuid=84118 --setgid=84118 -- /opt/nomad/plugins/nomad-driver-exec2 exec2-shim true /tmp/alloc_mounts/8a500d41-e2be-875e-d906-5117d0355a2b-exec2/alloc/logs/.exec2.stdout.fifo /tmp/alloc_mounts/8a500d41-e2be-875e-d906-5117d0355a2b-exec2/alloc/logs/.exec2.stderr.fifo rwxc:/tmp/alloc_mounts/8a500d41-e2be-875e-d906-5117d0355a2b-exec2/local rwxc:/tmp/alloc_mounts/8a500d41-e2be-875e-d906-5117d0355a2b-exec2/alloc rwxc:/tmp/alloc_mounts/8a500d41-e2be-875e-d906-5117d0355a2b-exec2/secrets rwxc:/tmp/alloc_mounts/8a500d41-e2be-875e-d906-5117d0355a2b-exec2/tmp -- /bin/bash -c "echo 'it worked'; sleep 10" || echo $?

unshare: failed to execute /opt/nomad/plugins/nomad-driver-exec2: Permission denied
126

This can be captured and sent to the task log fifo. PoC PR incoming.