google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.83k stars 1.3k forks source link

Runtime fails to mount /sys when --tpuproxy is provided #10795

Closed pawalt closed 1 month ago

pawalt commented 2 months ago

Description

I'm testing out TPU support with the runsc docker shim. When I use runsc normally, everything works fine, but when used with --tpuproxy, it fails to mount /sys. This is surprising to me because the mount is definitely there.

cc @thundergolfer

Steps to reproduce

I've configured docker to use my custom runsc script:

peyton@t1v-n-901fc2b8-w-0:~/tputesting$ cat /etc/docker/daemon.json
{
    "bip": "169.254.123.1/24",
    "runtimes": {
        "runsc": {
           "path": "/home/peyton/tputesting/runsc-wrapper.sh"
        }
    }
}
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ cat runsc-wrapper.sh 
#!/bin/bash

exec /usr/local/bin/runsc --tpuproxy "$@"
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ docker run --rm --runtime=runsc busybox echo "Hello from busybox"
docker: Error response from daemon: OCI runtime start failed: starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{true false false false} true {true  0xc000871710} false}: unknown.

This happens despite the mount existing:

peyton@t1v-n-901fc2b8-w-0:~/tputesting$ mount | grep sys
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755,inode64)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=24590)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)

I'm using a v5lite tpu:

TPU type                        v5litepod-1
TPU software version    tpu-vm-base

runsc version

peyton@t1v-n-901fc2b8-w-0:~/tputesting$ runsc -version
runsc version release-20240807.0
spec: 1.1.0-rc.1

docker version (if using docker)

peyton@t1v-n-901fc2b8-w-0:~/tputesting$ docker version
Client: Docker Engine - Community
 Version:           20.10.16
 API version:       1.41
 Go version:        go1.17.10
 Git commit:        aa7e414
 Built:             Thu May 12 09:17:23 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.16
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.10
  Git commit:       f756502
  Built:            Thu May 12 09:15:28 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.4
  GitCommit:        212e8b6fa2f44b9c21b2798135fc6fb7c53efc16
 runc:
  Version:          1.1.1
  GitCommit:        v1.1.1-0-g52de29d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

uname

peyton@t1v-n-901fc2b8-w-0:~/tputesting$ uname -a
Linux t1v-n-901fc2b8-w-0 5.13.0-1027-gcp #32~20.04.1-Ubuntu SMP Thu May 26 10:53:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

runsc debug logs (if available)

I0819 18:02:32.801671   23344 main.go:201] **************** gVisor ****************
D0819 18:02:32.801687   23344 state_file.go:77] Load container, rootDir: "/var/run/docker/runtime-runc/moby", id: {SandboxID: ContainerID:a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0}, opts: {Exact:false SkipCheck:false TryLock:false RootContainer:false}
D0819 18:02:32.802692   23344 sandbox.go:1891] ContainerRuntimeState, sandbox: "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0", cid: "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0"
D0819 18:02:32.802708   23344 sandbox.go:688] Connecting to sandbox "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0"
D0819 18:02:32.802767   23344 urpc.go:571] urpc: successfully marshalled 124 bytes.
D0819 18:02:32.803072   23344 urpc.go:614] urpc: unmarshal success.
D0819 18:02:32.803098   23344 sandbox.go:1896] ContainerRuntimeState, sandbox: "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0", cid: "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0", state: 1
W0819 18:02:32.803327   23344 specutils.go:123] AppArmor profile "docker-default" is being ignored
W0819 18:02:32.803340   23344 specutils.go:129] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set.
D0819 18:02:32.803350   23344 container.go:427] Start container, cid: a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0
D0819 18:02:32.803364   23344 sandbox.go:394] Start root sandbox "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0", PID: 23277
D0819 18:02:32.803370   23344 sandbox.go:688] Connecting to sandbox "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0"
I0819 18:02:32.803393   23344 network.go:55] Setting up network
I0819 18:02:32.803452   23344 namespace.go:108] Applying namespace network at path "/proc/23277/ns/net"
D0819 18:02:32.803721   23344 network.go:300] Setting up network channels
D0819 18:02:32.803737   23344 network.go:303] Creating Channel 0
D0819 18:02:32.803755   23344 network.go:334] Setting up network, config: {FilePayload:{Files:[0xc000350e00]} LoopbackLinks:[{Name:lo Addresses:[127.0.0.1/8] Routes:[{Destination:{IP:127.0.0.0 Mask:ff000000} Gateway:<nil>}] GVisorGRO:false}] FDBasedLinks:[{Name:eth0 InterfaceIndex:0 MTU:1500 Addresses:[169.254.123.3/24] Routes:[{Destination:{IP:169.254.123.0 Mask:ffffff00} Gateway:<nil>}] GSOMaxSize:65536 GVisorGSOEnabled:false GVisorGRO:false TXChecksumOffload:false RXChecksumOffload:true LinkAddress:02:42:a9:fe:7b:03 QDisc:fifo Neighbors:[] NumChannels:1 ProcessorsPerChannel:0}] XDPLinks:[] Defaultv4Gateway:{Route:{Destination:{IP:0.0.0.0 Mask:00000000000000000000ffff00000000} Gateway:169.254.123.1} Name:eth0} Defaultv6Gateway:{Route:{Destination:{IP:<nil> Mask:<nil>} Gateway:<nil>} Name:} PCAP:false LogPackets:false NATBlob:false DisconnectOk:false}
D0819 18:02:32.803914   23344 urpc.go:571] urpc: successfully marshalled 946 bytes.
D0819 18:02:32.805693   23344 urpc.go:614] urpc: unmarshal success.
I0819 18:02:32.805706   23344 namespace.go:129] Restoring namespace network
D0819 18:02:32.805727   23344 urpc.go:571] urpc: successfully marshalled 112 bytes.
D0819 18:02:32.818442   23344 urpc.go:614] urpc: unmarshal success.
W0819 18:02:32.818469   23344 util.go:64] FATAL ERROR: starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{true false false false} true {true  0xc00071bad0} false}
manninglucas commented 2 months ago

Hi Peyton, thanks for reporting this bug. The error you're seeing likely happening while building the internal gViosr sysfs, not because /sys doesn't exist on the host. When --tpuproxy is enabled the sandbox builds a mirror of the host PCI directories located in sysfs. The userspace tpu driver relies on the presence of these files to get information about the TPU hardware (version, topology, etc) running on the host. Can you show me what you get when you run ls -l /sys/bus/pci/devices in your VM?

milantracy commented 2 months ago

iirc, you can't run tpuproxy via exec /usr/local/bin/runsc --tpuproxy "$@"

the similar command which works for nvproxy because nvidia-container-runtime is directly compatible with the --gpus flag implemented by the docker CLI.

it has not implemented in tpuproxy, then tpu devices are not accessible in your docker container.

pawalt commented 2 months ago

@manninglucas sure thing here it is:

peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ls -l /sys/bus/pci/devices
total 0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:01.0 -> ../../../devices/pci0000:00/0000:00:01.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:01.3 -> ../../../devices/pci0000:00/0000:00:01.3
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:03.0 -> ../../../devices/pci0000:00/0000:00:03.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:04.0 -> ../../../devices/pci0000:00/0000:00:04.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:05.0 -> ../../../devices/pci0000:00/0000:00:05.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:06.0 -> ../../../devices/pci0000:00/0000:00:06.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:07.0 -> ../../../devices/pci0000:00/0000:00:07.0

@milantracy Does this mean we can't use --tpuproxy at all with the Docker shim, or is there just some other way I need to invoke it? And if it's not possible, I assume it should work OK if I invoke runsc raw?

milantracy commented 2 months ago

afaik, --tpuproxy doesn't work with the docker shim. cc: @manninglucas

I tried the raw runsc in the TPU v5e vm, which worked fine for me, let me know how it goes for you.

pawalt commented 2 months ago

@milantracy would you mind sharing the command you're using to start runsc? I'm still having no luck using runsc do

#/bin/bash

sudo runsc --tpuproxy --root=/home/peyton/tputesting/runroot do --root=/home/peyton/tputesting/jax-rootfs -- env -u LD_PRELOAD /bin/bash
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh 
starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{false false false false} false {true  0xc000620990} false}

EDIT: I'm also getting the same behavior with runsc run:

peyton@t1v-n-901fc2b8-w-0:~/tputesting$ cat start.sh 
#/bin/bash

sudo runsc --root=/home/peyton/tputesting/runroot --tpuproxy run --bundle=/home/peyton/tputesting my-jax-container
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh 
running container: starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{true false false false} true {true  0xc0005a2420} false}

And my config.json:

{
    "ociVersion": "1.0.2",
    "process": {
        "terminal": true,
        "user": {
            "uid": 0,
            "gid": 0
        },
        "args": [
            "/bin/sh"
        ],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "LANG=C.UTF-8",
            "PYTHONUNBUFFERED=1"
        ],
        "cwd": "/"
    },
    "root": {
        "path": "jax-rootfs",
        "readonly": false
    },
    "hostname": "jax-container",
    "mounts": [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc"
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        }
    ],
    "linux": {
        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "network"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            }
        ]
    }
}
pawalt commented 2 months ago

I've managed to track down the error to this function: https://github.com/google/gvisor/blob/e0643b8ed582cc549272e7788860a5dd4636c06d/pkg/sentry/fsimpl/sys/pci.go#L223

Specifically, this function fails when passed /sys/devices. It returns ENOENT. This is despite the file definitely existing:

peyton@t1v-n-901fc2b8-w-0:~/gvisor$ ls -alh /sys
total 4.0K
dr-xr-xr-x  13 root root    0 Aug 20 16:23 .
drwxr-xr-x  19 root root 4.0K Aug 20 16:24 ..
drwxr-xr-x   2 root root    0 Aug 20 16:23 block
drwxr-xr-x  40 root root    0 Aug 20 16:23 bus
drwxr-xr-x  68 root root    0 Aug 20 16:23 class
drwxr-xr-x   4 root root    0 Aug 20 16:23 dev
drwxr-xr-x  15 root root    0 Aug 20 16:23 devices
drwxr-xr-x   6 root root    0 Aug 20 16:23 firmware
drwxr-xr-x   9 root root    0 Aug 20 16:23 fs
drwxr-xr-x   2 root root    0 Aug 20 19:22 hypervisor
drwxr-xr-x  16 root root    0 Aug 20 16:23 kernel
drwxr-xr-x 152 root root    0 Aug 20 16:23 module
drwxr-xr-x   3 root root    0 Aug 20 19:22 power

I'll continue to investigate why this directory can't be opened.

pawalt commented 2 months ago

It seems to me that there's something wrong with the mount. When I look inside the sandbox's namespace, /sys does not exist, but I expect it to:

peyton@t1v-n-901fc2b8-w-0:~/gvisor$ sudo ls /proc/75785/root/
etc  proc

I'm not sure where to go from here - any pointers on what this should look like would be appreciated.

milantracy commented 2 months ago

it has been a while since I did it last time, I will share the runsc command with you later.

@milantracy would you mind sharing the command you're using to start runsc? I'm still having no luck using runsc do

#/bin/bash

sudo runsc --tpuproxy --root=/home/peyton/tputesting/runroot do --root=/home/peyton/tputesting/jax-rootfs -- env -u LD_PRELOAD /bin/bash
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh 
starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{false false false false} false {true  0xc000620990} false}

EDIT: I'm also getting the same behavior with runsc run:

peyton@t1v-n-901fc2b8-w-0:~/tputesting$ cat start.sh 
#/bin/bash

sudo runsc --root=/home/peyton/tputesting/runroot --tpuproxy run --bundle=/home/peyton/tputesting my-jax-container
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh 
running container: starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{true false false false} true {true  0xc0005a2420} false}

And my config.json:

{
    "ociVersion": "1.0.2",
    "process": {
        "terminal": true,
        "user": {
            "uid": 0,
            "gid": 0
        },
        "args": [
            "/bin/sh"
        ],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "LANG=C.UTF-8",
            "PYTHONUNBUFFERED=1"
        ],
        "cwd": "/"
    },
    "root": {
        "path": "jax-rootfs",
        "readonly": false
    },
    "hostname": "jax-container",
    "mounts": [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc"
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        }
    ],
    "linux": {
        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "network"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            }
        ]
    }
}

also, can you share with me what the sys directory looks like in the container

pawalt commented 2 months ago

@milantracy When I don't pass --tpuproxy, this is what it looks like:

peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh 
Child PID: 82314
Press Enter to continue...

# ls -alh /sys
total 0
drwxr-xr-x 12 root root  0 Aug 20 21:47 .
drwxrwxr-x  2 2004 2004 60 Aug 20 21:47 ..
drwxr-xr-x  2 root root  0 Aug 20 21:47 block
drwxr-xr-x  2 root root  0 Aug 20 21:47 bus
drwxr-xr-x  4 root root  0 Aug 20 21:47 class
drwxr-xr-x  2 root root  0 Aug 20 21:47 dev
drwxr-xr-x  4 root root  0 Aug 20 21:47 devices
drwxr-xr-x  2 root root  0 Aug 20 21:47 firmware
drwxr-xr-x  3 root root  0 Aug 20 21:47 fs
drwxr-xr-x  2 root root  0 Aug 20 21:47 kernel
drwxr-xr-x  2 root root  0 Aug 20 21:47 module
drwxr-xr-x  2 root root  0 Aug 20 21:47 power

When I do pass --tpuproxy, then /sys never gets mounted, so it doesn't exist.

manninglucas commented 2 months ago

When I spin up a cluster in GKE and run with tpuproxy, this is the sandbox spec that gets used. I would try to copy this spec wrt the mounts and devices sections specifically and see how that works. I see you don't have a /sys mount in your config.json. You may need to add a /sys mount specifically in the spec to get it working properly.

{
  "ociVersion": "1.1.0",
  "process": {
    "user": {
      "uid": 0,
      "gid": 0,
      "additionalGids": [
        0
      ]
    },
    "args": [
      "bash",
      "-c",
      "python -c 'import jax; print(\"TPU cores:\", jax.device_count())'"
    ],
    "env": [
      "PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "HOSTNAME=tpu-gvisor",
      "LANG=C.UTF-8",
      "GPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696D",
      "PYTHON_VERSION=3.10.14",
      "PYTHON_PIP_VERSION=23.0.1",
      "PYTHON_SETUPTOOLS_VERSION=65.5.1",
      "PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/e03e1607ad60522cf34a92e834138eb89f57667c/public/get-pip.py",
      "PYTHON_GET_PIP_SHA256=ee09098395e42eb1f82ef4acb231a767a6ae85504a9cf9983223df0a7cbd35d7",
      "TPU_SKIP_MDS_QUERY=true",
      "TPU_TOPOLOGY=2x2x1",
      "ALT=false",
      "TPU_HOST_BOUNDS=1,1,1",
      "HOST_BOUNDS=1,1,1",
      "TPU_RUNTIME_METRICS_PORTS=8431,8432,8433,8434",
      "CHIPS_PER_HOST_BOUNDS=2,2,1",
      "TPU_CHIPS_PER_HOST_BOUNDS=2,2,1",
      "TPU_WORKER_ID=0",
      "TPU_WORKER_HOSTNAMES=localhost",
      "TPU_ACCELERATOR_TYPE=v5p-8",
      "WRAP=false,false,false",
      "TPU_TOPOLOGY_WRAP=false,false,false",
      "TPU_TOPOLOGY_ALT=false",
      "KUBERNETES_PORT_443_TCP_ADDR=34.118.224.1",
      "KUBERNETES_SERVICE_HOST=34.118.224.1",
      "KUBERNETES_SERVICE_PORT=443",
      "KUBERNETES_SERVICE_PORT_HTTPS=443",
      "KUBERNETES_PORT=tcp://34.118.224.1:443",
      "KUBERNETES_PORT_443_TCP=tcp://34.118.224.1:443",
      "KUBERNETES_PORT_443_TCP_PROTO=tcp",
      "KUBERNETES_PORT_443_TCP_PORT=443"
    ],
    "cwd": "/",
    "apparmorProfile": "cri-containerd.apparmor.d",
    "oomScoreAdj": 1000
  },
  "root": {
    "path": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/rootfs"
  },
  "mounts": [
    {
      "destination": "/proc",
      "type": "proc",
      "source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/proc",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
        {
      "destination": "/dev",
      "type": "tmpfs",
      "source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/tmpfs",
      "options": [
        "nosuid",
        "strictatime",
        "mode=755",
        "size=65536k"
      ]
    },
    {
      "destination": "/dev/pts",
      "type": "devpts",
      "source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/devpts",
      "options": [
        "nosuid",
        "noexec",
        "newinstance",
        "ptmxmode=0666",
        "mode=0620",
        "gid=5"
      ]
    },
    {
      "destination": "/dev/mqueue",
      "type": "mqueue",
      "source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/mqueue",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
    {
      "destination": "/sys",
      "type": "sysfs",
      "source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/sysfs",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "ro"
      ]
    },
    {
      "destination": "/sys/fs/cgroup",
      "type": "cgroup",
      "source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/cgroup",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "relatime",
        "ro"
      ]
    },
    {
      "destination": "/etc/hosts",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/0c23b742-a930-45e1-80d3-2b358141671e/etc-hosts",
      "options": [
        "rbind",
        "rprivate",
        "rw"
      ]
    },
    {
      "destination": "/etc/hostname",
      "type": "bind",
      "source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/d711f9789021bb54a955a5c4155b796a79f508db3d9376fafc814ee91a0ce560/hostname",
      "options": [
        "rbind",
        "rprivate",
        "rw"
      ]
    },
    {
      "destination": "/etc/resolv.conf",
      "type": "bind",
      "source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/d711f9789021bb54a955a5c4155b796a79f508db3d9376fafc814ee91a0ce560/resolv.conf",
      "options": [
        "rbind",
        "rprivate",
        "rw"
      ]
    },
    {
      "destination": "/dev/shm",
      "type": "tmpfs",
      "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d711f9789021bb54a955a5c4155b796a79f508db3d9376fafc814ee91a0ce560/shm",
      "options": [
        "rprivate",
        "rw"
      ]
    },
    {
      "destination": "/run/secrets/kubernetes.io/serviceaccount",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/0c23b742-a930-45e1-80d3-2b358141671e/volumes/kubernetes.io~projected/kube-api-access-9fzvk",
      "options": [
        "rbind",
        "rprivate",
        "ro"
      ]
    }
  ],
  "annotations": {
    "dev.gvisor.flag.debug": "true",
    "dev.gvisor.flag.debug-log": "/tmp/runsc/",
    "dev.gvisor.flag.panic-log": "/tmp/runsc/panic.log",
    "dev.gvisor.flag.strace": "true",
    "dev.gvisor.internal.tpuproxy": "true",
    "io.kubernetes.cri.container-name": "tpu-gvisor",
    "io.kubernetes.cri.container-type": "container",
    "io.kubernetes.cri.image-name": "gcr.io/gvisor-presubmit/tpu/jax_x86_64:latest",
    "io.kubernetes.cri.sandbox-id": "d711f9789021bb54a955a5c4155b796a79f508db3d9376fafc814ee91a0ce560",
    "io.kubernetes.cri.sandbox-name": "tpu-gvisor",
    "io.kubernetes.cri.sandbox-namespace": "default",
    "io.kubernetes.cri.sandbox-uid": "0c23b742-a930-45e1-80d3-2b358141671e"
  },
  "linux": {
    "uidMappings": [
      {
             "containerID": 0,
        "hostID": 0,
        "size": 4294967295
      }
    ],
    "gidMappings": [
      {
        "containerID": 0,
        "hostID": 0,
        "size": 4294967295
      }
    ],
    "resources": {
      "memory": {},
      "cpu": {
        "shares": 2,
        "period": 100000
      },
      "unified": {
        "memory.oom.group": "1",
        "memory.swap.max": "0"
      }
    },
    "cgroupsPath": "kubepods-besteffort-pod0c23b742_a930_45e1_80d3_2b358141671e.slice:cri-containerd:27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00",
    "namespaces": [
      {
        "type": "pid"
      },
      {
        "type": "ipc",
        "path": "/proc/8016/ns/ipc"
      },
      {
        "type": "uts",
        "path": "/proc/8016/ns/uts"
      },
      {
        "type": "mount"
      },
      {
        "type": "network",
        "path": "/proc/8016/ns/net"
      },
      {
        "type": "cgroup"
      },
      {
        "type": "user"
      }
    ],
    "devices": [
      {
        "path": "/dev/vfio/2",
        "type": "c",
        "major": 245,
        "minor": 1,
        "fileMode": 438,
        "uid": 0,
        "gid": 0
      },
      {
        "path": "/dev/vfio/3",
        "type": "c",
        "major": 245,
        "minor": 0,
        "fileMode": 438,
        "uid": 0,
        "gid": 0
      },
      {
        "path": "/dev/vfio/0",
        "type": "c",
        "major": 245,
        "minor": 3,
        "fileMode": 438,
        "uid": 0,
        "gid": 0
      },
      {
        "path": "/dev/vfio/1",
        "type": "c",
        "major": 245,
        "minor": 2,
        "fileMode": 438,
        "uid": 0,
        "gid": 0
      },
      {
        "path": "/dev/vfio/vfio",
        "type": "c",
        "major": 10,
        "minor": 196,
        "fileMode": 438,
        "uid": 0,
        "gid": 0
      }
    ]
  }
}
pawalt commented 2 months ago

@manninglucas thanks for the config! I've tried this with a /sys mount, and I'm still getting the same error:

{
    "ociVersion": "1.0.2",
    "process": {
        "terminal": true,
        "user": {
            "uid": 0,
            "gid": 0
        },
        "args": [
            "/bin/sh"
        ],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "LANG=C.UTF-8",
            "PYTHONUNBUFFERED=1"
        ],
        "cwd": "/"
    },
    "root": {
        "path": "jax-rootfs",
        "readonly": false
    },
    "hostname": "jax-container",
    "mounts": [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc"
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        },
        {
            "destination": "/sys",
            "type": "sysfs",
            "source": "/sys",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "ro"
            ]
        }
    ],
    "linux": {
        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "network"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            }
        ]
    }
}

and the run:

peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh 
Child PID: 235453
Press Enter to continue...

running container: starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{true false false false} true {true  0xc00059e8a0} false}

For what it's worth, this error message is not coming from the inner container - it's coming from the runsc-sandbox process which is in its own mount namespace that appears to only have /etc/ and /proc. This leads me to believe that the sandbox process needs to have /sys mirrored in, but I'm not sure how to do that.

From my testing, I'm actually not sure how the GKE sandbox example works. It looks to me like /sys isn't mounted in the sandbox's namespace, so I'm surprised that the hostDirEntries call doesn't fail. Are you avoiding putting the sandbox process in its own namespace or something?

pawalt commented 2 months ago

I see the issue - the code is looking for TPU devices in specific paths to see if it should bind down to the container. The issues is that in GCE VMs, those paths don't exist! I'm not sure how they work in the first place then, though :)

peyton@t1v-n-901fc2b8-w-0:~/tputesting$ sudo find /dev | grep vfio
/dev/vfio
/dev/vfio/vfio
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ sudo find /dev | grep accel
manninglucas commented 2 months ago

That behavior is very strange to me. FWIW here's what I see in my VM when I find /dev/ | grep vfio.

/dev/vfio
/dev/vfio/0
/dev/vfio/1
/dev/vfio/2
/dev/vfio/3
/dev/vfio/vfio

What do you get when you run an unsandboxed TPU workload? How did you create your TPU VM?

pawalt commented 2 months ago

@manninglucas It turns out that at least part of this issue was the TPU VM image I was using. I was using tpu-vm-base, which apparently is quite out of date: https://github.com/google/jax/issues/13260

I've now switched to tpu-ubuntu2204-base:

➜  ~ gcloud compute tpus tpu-vm create another-peyton-tpu \                                                  
--zone=us-central1-a \
--accelerator-type=v5litepod-1 \
--version=tpu-ubuntu2204-base \
--project=<redacted>

While this is getting further, it's now failing at a later step in the parsing:

W0822 17:59:26.377299   62469 util.go:64] FATAL ERROR: error setting up chroot: error configuring chroot for TPU devices: extracting TPU device minor: open /sys/class/vfio-dev/vfio0/device/vendor: no such file or directory
error setting up chroot: error configuring chroot for TPU devices: extracting TPU device minor: open /sys/class/vfio-dev/vfio0/device/vendor: no such file or directory

Do you know what image the GKE VMs are using?

pawalt commented 2 months ago

I may need to use v2-alpha-tpuv5-lite. I'll try that and get back to you. The fact that device mounting is different depending on the image used is really surprising to me. And it's even more surprising that you're allowed to mount incompatible images. https://cloud.google.com/tpu/docs/runtimes#pytorch_and_jax

pawalt commented 2 months ago

@manninglucas I've tried with the new image with no luck. It looks like the device layout still does not match what GVisor expects. If you know what VM image GKE uses that would be helpful. Here is some output:

peyton@t1v-n-9becfdd7-w-0:~/tputesting$ python3 -c "import jax; print(jax.device_count()); print(repr(jax.numpy.add(1, 1)))"
1
Array(2, dtype=int32, weak_type=True)
peyton@t1v-n-9becfdd7-w-0:~/tputesting$ sudo find /sys/class/vfio/
/sys/class/vfio/
/sys/class/vfio/0
peyton@t1v-n-9becfdd7-w-0:~/tputesting$ sudo find /dev/vfio/
/dev/vfio/
/dev/vfio/0
/dev/vfio/vfio
peyton@t1v-n-9becfdd7-w-0:~/tputesting$ ./start.sh 
running container: creating container: cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF

And here are the relevant logs again:

I0822 20:44:05.045982   81014 main.go:201] **************** gVisor ****************
I0822 20:44:05.046877   81014 boot.go:264] Setting product_name: "Google Compute Engine"
I0822 20:44:05.046939   81014 boot.go:274] Setting host-shmem-huge: "never"
W0822 20:44:05.047571   81014 specutils.go:129] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set.
I0822 20:44:05.047595   81014 chroot.go:91] Setting up sandbox chroot in "/tmp"
I0822 20:44:05.047707   81014 chroot.go:36] Mounting "/proc" at "/tmp/proc"
W0822 20:44:05.047808   81014 util.go:64] FATAL ERROR: error setting up chroot: error configuring chroot for TPU devices: extracting TPU device minor: open /sys/class/vfio-dev/vfio0/device/device: no such file or directory
error setting up chroot: error configuring chroot for TPU devices: extracting TPU device minor: open /sys/class/vfio-dev/vfio0/device/device: no such file or directory
manninglucas commented 2 months ago

I believe the image based on COS, should be something like "tpu-vm-cos-109"

pawalt commented 2 months ago

@manninglucas Nice those paths exist on that image:

peyton@t1v-n-00ca9571-w-0 ~ $ sudo find /dev/vfio/
/dev/vfio/
/dev/vfio/0
/dev/vfio/vfio
peyton@t1v-n-00ca9571-w-0 ~ $ sudo find /sys/class/vfio-dev/
/sys/class/vfio-dev/
/sys/class/vfio-dev/vfio0

This image is painful to work with because of the read-only filesystem, though. I may have to bite the bullet and figure out how to do the device mapping on v2-alpha-tpuv5-lite.

manninglucas commented 2 months ago

I will have a patch up soon that will hopefully fix the issue for the ubuntu image you're using. Seems like /sys/class/vfio-dev/vfio0 just corresponds to /sys/class/vfio/0.

EtiennePerot commented 2 months ago

This image is painful to work with because of the read-only filesystem, though. I may have to bite the bullet and figure out how to do the device mapping on v2-alpha-tpuv5-lite.

You can remount the filesystem with mount -o remount,rw as root.

btw, COS has a tool called cos-toolbox which works around this issue and makes it easier to work with in general. It should be available by default.

manninglucas commented 1 month ago

Hey @pawalt were you able to get this working for your needs?

thundergolfer commented 1 month ago

hey @manninglucas we deprioritized getting this working. I think Peyton was maybe going to restart the effort when the patch landed: https://github.com/google/gvisor/issues/10795#issuecomment-2305739322.

manninglucas commented 1 month ago

Gotcha. The patch has finally landed (290789b), let me know when you're able to test this out again!

pawalt commented 1 month ago

@manninglucas thanks! The container is now starting up. I'm seeing a different issue when trying to use jax in python. Not sure if you want to make that part of this issue or another one:

peyton@t1v-n-1f714773-w-0:~/tputesting$ ./start.sh 
# python3
Python 3.11.9 (main, Aug 13 2024, 02:18:20) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> jax.device_count()

Failed to get TPU metadata (tpu-env) from instance metadata for variable CHIPS_PER_HOST_BOUNDS: INTERNAL: Couldn't connect to server
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/gcp_metadata_utils.cc:99
learning/45eac/tfrc/runtime/env_var_utils.cc:50

Failed to get TPU metadata (tpu-env) from instance metadata for variable HOST_BOUNDS: INTERNAL: Couldn't connect to server
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/gcp_metadata_utils.cc:99
learning/45eac/tfrc/runtime/env_var_utils.cc:50

^C^C^C^CFailed to get TPU metadata (tpu-env) from instance metadata for variable ALT: INTERNAL: Couldn't connect to server
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/gcp_metadata_utils.cc:99
learning/45eac/tfrc/runtime/env_var_utils.cc:50

The command hangs on the jax.device_count() call, and it loops, spitting out a lot of these logs:

I0917 13:42:09.296208       1 strace.go:576] [   2:  26] python3 E futex(0x7ef41c4bce54, FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG, 0x0, 0x7ef3cf9feb60 {sec=182 nsec=996258753}, 0x0, 0xffffffff)
I0917 13:42:09.301492       1 strace.go:614] [   2:  26] python3 X futex(0x7ef41c4bce54, FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG, 0x0, 0x7ef3cf9feb60 {sec=182 nsec=996258753}, 0x0, 0xffffffff) = 0 (0x0) errno=110 (connection timed out) (5.274369ms)
I0917 13:42:09.301517       1 strace.go:576] [   2:  26] python3 E futex(0x7ef41c4bce58, FUTEX_WAKE|FUTEX_PRIVATE_FLAG, 0x1, null, 0x3b66c325, 0x16e)
I0917 13:42:09.301526       1 strace.go:614] [   2:  26] python3 X futex(0x7ef41c4bce58, FUTEX_WAKE|FUTEX_PRIVATE_FLAG, 0x1, null, 0x3b66c325, 0x16e) = 0 (0x0) (840ns)
I0917 13:42:09.301539       1 strace.go:576] [   2:  26] python3 E futex(0x7ef41c4bce54, FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG, 0x0, 0x7ef3cf9feb60 {sec=183 nsec=1590373}, 0x0, 0xffffffff)
I0917 13:42:09.306891       1 strace.go:614] [   2:  26] python3 X futex(0x7ef41c4bce54, FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG, 0x0, 0x7ef3cf9feb60 {sec=183 nsec=1590373}, 0x0, 0xffffffff) = 0 (0x0) errno=110 (connection timed out) (5.341239ms)
I0917 13:42:09.306924       1 strace.go:576] [   2:  26] python3 E futex(0x7ef41c4bce58, FUTEX_WAKE|FUTEX_PRIVATE_FLAG, 0x1, null, 0x1e73de, 0x170)
I0917 13:42:09.306938       1 strace.go:614] [   2:  26] python3 X futex(0x7ef41c4bce58, FUTEX_WAKE|FUTEX_PRIVATE_FLAG, 0x1, null, 0x1e73de, 0x170) = 0 (0x0) (1.32µs)
I0917 13:42:09.306953       1 strace.go:576] [   2:  26] python3 E futex(0x7ef41c4bce54, FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG, 0x0, 0x7ef3cf9feb60 {sec=183 nsec=6995742}, 0x0, 0xffffffff)

My startup script:

sudo runsc --debug \
    --debug-log=/home/peyton/tputesting/logs/ \
    --strace \
    --root=/home/peyton/tputesting/runroot \
    --tpuproxy \
    run \
    --bundle=/home/peyton/tputesting \
    my-jax-container

I'm using a jax image exported from the build below:

FROM python:3.11

RUN pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
manninglucas commented 1 month ago

@pawalt let's follow up with a new issue. Looks like libtpu is looking for some metadata that might be stored in an environment variable. Can you run env on the host?

pawalt commented 1 month ago

@manninglucas I've opened #10923