google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.89k stars 1.31k forks source link

'nvproxy: rejecting frontendFDMemmapFile.MapInternal' when running llm.c on 1x H100 #10879

Closed thundergolfer closed 2 months ago

thundergolfer commented 2 months ago

Description

I was following https://github.com/karpathy/llm.c/discussions/677 using a gpu_1x_h100_pcie Lambda Labs instance and it fails on runsc with an issue I hadn't encountered before:

...
D0908 23:43:32.284178       1 uvm.go:136] [   1:   1] nvproxy: uvm ioctl 33 = 0x21
I0908 23:43:32.285828       1 strace.go:605] [   1:   1] train_gpt2cu X ioctl(0x19 /dev/nvidia-uvm, 0x21, 0x7fa19036c450) = 0 (0x0) (1.642965ms)
I0908 23:43:32.300429       1 strace.go:567] [   1:   1] train_gpt2cu E read(0x40 /llm.c/gpt2_124M_bf16.bin, 0x7f2c88000c00, 0x1fff000)
W0908 23:43:32.300773       1 log.go:351] nvproxy: rejecting frontendFDMemmapFile.MapInternal:
goroutine 212 [running]:
...

I don't really understand this part of the nvproxy well so thought it'd be most productive to send it here :)

Steps to reproduce

Run the following command on an image created from the Dockerfile below:

sudo docker run --gpus=all --runtime=runsc -it --workdir=/llm.c sha256:71fa2c57c10d070e7c81e481d5a41b476fde94f5b5b1bea77f5a2438c826cb1f ./train_gpt2cu

After around 10 seconds the following error shows:

Error: File read error at llmc/cuda_common.h:182
Error details:
  File: llmc/cuda_common.h
  Line: 182
  Expected elements: 33554432
  Read elements: 3072

Dockerfile

FROM nvidia/cuda:12.4.0-devel-ubuntu22.04
RUN apt-get update && apt-get -y install git wget curl
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
        dpkg -i cuda-keyring_1.1-1_all.deb && \
        apt-get update && \ 
        apt-get -y install libcudnn9-dev-cuda-12 && \
        git clone https://github.com/NVIDIA/cudnn-frontend.git ~/cudnn-frontend && \
        apt -y install openmpi-bin openmpi-doc libopenmpi-dev

RUN git clone https://github.com/karpathy/llm.c.git && \
        cd llm.c && git checkout bd457aa19bdb7c0776725f05fe9ecb692558aed8 && \
       ./dev/download_starter_pack.sh && \
        make train_gpt2cu USE_CDNN=1 && \
        test -f train_gpt2cu

RUN cd llm.c/dev/data/ && ./edu_fineweb.sh 10

NVIDIA Driver

NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2

runsc version

./bin/runsc --version
runsc version 0.0.0
spec: 1.1.0-rc.1

I built from source at commit: fa49677e141db94798f226dfb453f8771c14ae6f.

I used this version as it's the last before driver support for 535.129.03 was dropped.

docker version (if using docker)

sudo docker version
Client: Docker Engine - Community
 Version:           24.0.7
 API version:       1.43
 Go version:        go1.20.10
 Git commit:        afdd53b
 Built:             Thu Oct 26 09:07:41 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.7
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.10
  Git commit:       311b9ff
  Built:            Thu Oct 26 09:07:41 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.25
  GitCommit:        d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
 runc:
  Version:          1.1.10
  GitCommit:        v1.1.10-0-g18a0cb0
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

uname

Linux 209-20-158-39 6.2.0-37-generic #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 2 18:01:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

No response

repo state (if built from source)

git show
commit fa49677e141db94798f226dfb453f8771c14ae6f (HEAD)
Author: Andrei Vagin <avagin@google.com>
Date:   Fri Aug 30 13:33:47 2024 -0700

    Internal change

    PiperOrigin-RevId: 669428490

diff --git a/pkg/sentry/platform/kvm/machine.go b/pkg/sentry/platform/kvm/machine.go
index 278694d79..2d656588b 100644
--- a/pkg/sentry/platform/kvm/machine.go
+++ b/pkg/sentry/platform/kvm/machine.go
...

runsc debug logs (if available)

Here's a snippet around the error. I can include the full file if desired.

D0908 23:43:32.276734       1 frontend.go:518] [   1:   1] nvproxy: control command 0xd01, object 0xc1d00075
I0908 23:43:32.276815       1 strace.go:605] [   1:   1] train_gpt2cu X ioctl(0x18 /dev/nvidiactl, 0xc020462a, 0x7fa19036e620) = 0 (0x0) (107.409µs)
I0908 23:43:32.277316       1 strace.go:570] [   1:   1] train_gpt2cu E openat(AT_FDCWD /llm.c, 0x7fa19036efd0 /dev/nvidiactl, O_RDWR|O_CLOEXEC, 0o0)
I0908 23:43:32.277495       1 strace.go:608] [   1:   1] train_gpt2cu X openat(AT_FDCWD /llm.c, 0x7fa19036efd0 /dev/nvidiactl, O_RDWR|O_CLOEXEC, 0o0) = 65 (0x41) (114.868µs)
I0908 23:43:32.278068       1 strace.go:567] [   1:   1] train_gpt2cu E fcntl(0x41 /dev/nvidiactl, 0x1, 0x0)
I0908 23:43:32.278138       1 strace.go:605] [   1:   1] train_gpt2cu X fcntl(0x41 /dev/nvidiactl, 0x1, 0x0) = 1 (0x1) (3.849µs)
I0908 23:43:32.278636       1 strace.go:567] [   1:   1] train_gpt2cu E ioctl(0x18 /dev/nvidiactl, 0xc038464e, 0x7fa19036f120)
D0908 23:43:32.278695       1 frontend.go:227] [   1:   1] nvproxy: frontend ioctl: nr = 78 = 0x4e, argSize = 56
I0908 23:43:32.278805       1 strace.go:605] [   1:   1] train_gpt2cu X ioctl(0x18 /dev/nvidiactl, 0xc038464e, 0x7fa19036f120) = 0 (0x0) (106.412µs)
I0908 23:43:32.279276       1 strace.go:576] [   1:   1] train_gpt2cu E mmap(0x7f2c88000000, 0x4000000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 0x41 /dev/nvidiactl, 0x0)
I0908 23:43:32.279361       1 strace.go:614] [   1:   1] train_gpt2cu X mmap(0x7f2c88000000, 0x4000000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 0x41 /dev/nvidiactl, 0x0) = 139829236989952 (0x7f2c88000000) (20.174µs)
I0908 23:43:32.279860       1 strace.go:561] [   1:   1] train_gpt2cu E close(0x41 /dev/nvidiactl)
I0908 23:43:32.282636       1 strace.go:599] [   1:   1] train_gpt2cu X close(0x41 /dev/nvidiactl) = 0 (0x0) (2.64628ms)
I0908 23:43:32.283814       1 strace.go:567] [   1:   1] train_gpt2cu E ioctl(0x19 /dev/nvidia-uvm, 0x49, 0x7fa19036ed80)
D0908 23:43:32.283879       1 uvm.go:136] [   1:   1] nvproxy: uvm ioctl 73 = 0x49
I0908 23:43:32.283957       1 strace.go:605] [   1:   1] train_gpt2cu X ioctl(0x19 /dev/nvidia-uvm, 0x49, 0x7fa19036ed80) = 0 (0x0) (75.398µs)
I0908 23:43:32.284160       1 strace.go:567] [   1:   1] train_gpt2cu E ioctl(0x19 /dev/nvidia-uvm, 0x21, 0x7fa19036c450)
D0908 23:43:32.284178       1 uvm.go:136] [   1:   1] nvproxy: uvm ioctl 33 = 0x21
I0908 23:43:32.285828       1 strace.go:605] [   1:   1] train_gpt2cu X ioctl(0x19 /dev/nvidia-uvm, 0x21, 0x7fa19036c450) = 0 (0x0) (1.642965ms)
I0908 23:43:32.300429       1 strace.go:567] [   1:   1] train_gpt2cu E read(0x40 /llm.c/gpt2_124M_bf16.bin, 0x7f2c88000c00, 0x1fff000)
W0908 23:43:32.300773       1 log.go:351] nvproxy: rejecting frontendFDMemmapFile.MapInternal:
goroutine 212 [running]:
gvisor.dev/gvisor/pkg/log.Stacks(0x0)
        pkg/log/log.go:319 +0x67
gvisor.dev/gvisor/pkg/log.Traceback({0x13bc4e7, 0x33}, {0x0, 0x0, 0x0})
        pkg/log/log.go:350 +0x3b
gvisor.dev/gvisor/pkg/sentry/devices/nvproxy.(*frontendFDMemmapFile).MapInternal(0x0?, {0x0?, 0x0?}, {0x0?, 0x0?, 0x0?})
        pkg/sentry/devices/nvproxy/frontend_mmap.go:81 +0x26
gvisor.dev/gvisor/pkg/sentry/mm.pmaIterator.getInternalMappingsLocked({0xc0010e5808?, 0x8ab280?})
        pkg/sentry/mm/pma.go:987 +0x9d
gvisor.dev/gvisor/pkg/sentry/mm.(*MemoryManager).getIOMappingsLocked(0xc000bd1008, {0xc0010e5808?, 0x15e4b98?}, {0xc0005e9508?, 0xc000f67008?}, {0x0?, 0x0?, 0x0?})
        pkg/sentry/mm/io.go:693 +0x8d
gvisor.dev/gvisor/pkg/sentry/mm.(*MemoryManager).withInternalMappings(0xc000bd1008, {0x15e4b98, 0xc0005e9508}, {0x15dab78?, 0xc000db9ca8?}, {0x0?, 0x0?, 0x0?}, 0x0, 0xc00138f250)
        pkg/sentry/mm/io.go:565 +0x419
gvisor.dev/gvisor/pkg/sentry/mm.(*MemoryManager).withVecInternalMappings(0x0?, {0x15e4b98?, 0xc0005e9508?}, {0x0?, 0x2?, 0xc000c37708?, 0xc000c37608?}, {0x0, 0x1, 0x0}, ...)
        pkg/sentry/mm/io.go:606 +0x5b8
gvisor.dev/gvisor/pkg/sentry/mm.(*MemoryManager).CopyOutFrom(0xc000bd1008, {0x15e4b98, 0xc0005e9508}, {0x0?, 0xc0006f8000?, 0x0?, 0x88?}, {0x15c4e40, 0xc000cd4ff0}, {0x0, ...})
        pkg/sentry/mm/io.go:287 +0x245
gvisor.dev/gvisor/pkg/usermem.IOSequence.CopyOutFrom(...)
        pkg/usermem/usermem.go:508
gvisor.dev/gvisor/pkg/sentry/fsimpl/gofer.(*regularFileFD).PRead(0xc00043f9e0, {0x15e4b98, 0xc0005e9508}, {{0x15dcd58, 0xc000bd1008}, {0x0, 0x1, 0x7f2c88000c00, 0x1fff000}, {0x0, ...}}, ...)

runsc.log.20240908-234318.130092.boot.txt

ayushr2 commented 2 months ago

IIUC train_gpt2cu is trying to write to /dev/nvidiactl at offset 0xc00. It first creates a writable mapping for /dev/nvidiactl at 0x7f2c88000000. Then a subsequent read syscall attempts to read from FD 0x40 (/llm.c/gpt2_124M_bf16.bin) into address 0x7f2c88000c00 (which is the mapped region start address + 0xc00).

I think we currently explicitly disallow this: https://github.com/google/gvisor/blob/3c4b246cf2947d6174af02fe895ff1c213ed7491/pkg/sentry/devices/nvproxy/frontend_mmap.go#L78-L83