GPU Checkpointing: Can't save pma with non-MemoryFile of type *nvproxy.frontendFDMemmapFile

mattnappo commented 4 months ago

Description

Overview Hi, I'm with modal.com. We are interested in using a combination of cuda-checkpoint and runsc checkpoint in order to snapshot GPUs within gVisor. The cuda-checkpoint utility freezes a CUDA process, and copies the GPU state into the CPU memory. We have managed to successfully run cuda-checkpoint from within a gVisor container. Ideally, we would then run runsc checkpoint (this is where the error lies). In principle, running the gVisor checkpointer after running the cuda checkpointer will checkpoint the GPU memory, as cuda-checkpoint moves the GPU memory into the CPU, which is then saved by runsc checkpoint.

Current Thinking We currently believe the reason for this error is that gVisor acquires GPU devices before checkpointing, which prevents the checkpoint from succeeding as there are device files left open. However, since gVisor doesn't need access to the GPU during a checkpoint, we believe that it should not hold the GPU device.

Potential Solution If this is indeed the source of the issue, then we would be content with a fix/patch that doesn't acquire the GPU devices, and makes it the user (our)'s job to keep track of mounting GPU devices on restore. If there is a way to make gVisor relinquish control of the GPU before checkpointing, that would also be desirable.

cc: @luiscape @thundergolfer

Steps to reproduce

Dockerfile:

FROM nvidia/cuda:12.4.1-devel-ubuntu20.04

WORKDIR /app

RUN apt-get update && \
    apt-get install -y \
        wget \
        python3-pip \
        python3-dev

RUN wget -O /bin/cuda-checkpoint https://github.com/NVIDIA/cuda-checkpoint/raw/main/bin/x86_64_Linux/cuda-checkpoint

RUN python3 -m pip install --upgrade pip

# PyTorch for Linux CUDA 12.1
RUN pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121

COPY <<EOF main.py
import torch, time, sys
if not torch.cuda.is_available():
    print("cuda is not available")
    sys.exit(-1)

counter = torch.tensor(0, device="cuda")
while True:
    print(counter)
    counter += 1
    time.sleep(1)
EOF

ENTRYPOINT ["python3", "/app/main.py"]

Then follow the steps here to create an OCI bundle.

Run with

sudo runsc -nvproxy -nvproxy-driver-version '550.54.14' -nvproxy-docker run nvtest

Then run cuda-checkpoint in the container (assuming pid of python3 /app/main.py is 1):

sudo runsc exec nvtest sh -c 'cuda-checkpoint --toggle --pid 1'

Up to this point, everything works (running nvidia-smi shows no GPU processes)

Now, the gVisor checkpoint:

sudo runsc checkpoint -leave-running -image-path image/ nvtest

This should trigger the error.

runsc version

runsc version release-20240513.0
spec: 1.1.0-rc.1

CUDA driver version: 550.54.14

docker version (if using docker)

N/A

uname

Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu x86_64 GNU/Linux

kubectl (if using Kubernetes)

N/A

repo state (if built from source)

N/A

runsc debug logs (if available)

Full logs: https://modal-public-assets.s3.amazonaws.com/gpu_ckpt_logs.zip

Main stack trace is in 
gvisor_logs/runsc.log.20240522-215727.751628.checkpoint.txt

Partial stack trace:
+ sudo runsc -debug -debug-log gvisor_logs/ checkpoint -image-path image/ -leave-running 8b769be0-488b-4a84-8b80-6e861da7d5b7
checkpoint failed: checkpointing container "8b769be0-488b-4a84-8b80-6e861da7d5b7": encoding error: Can't save pma with non-MemoryFile of type *nvproxy.frontendFDMemmapFile:
goroutine 656 [running]:
gvisor.dev/gvisor/pkg/state.safely.func1()
    pkg/state/state.go:309 +0x179
panic({0x10a7e60?, 0xc00007d450?})
    GOROOT/src/runtime/panic.go:770 +0x132
gvisor.dev/gvisor/pkg/sentry/mm.(*pma).saveFile(0x4d47ed?)
    pkg/sentry/mm/save_restore.go:144 +0xf9
gvisor.dev/gvisor/pkg/sentry/mm.(*pma).StateSave(0xc000b6dea0, {{0xc000604708?, 0xc0008f4f18?}})
    bazel-out/k8-fastbuild/bin/pkg/sentry/mm/mm_state_autogen.go:373 +0x30
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeStruct(0xc000604708, {0x123c4e0, 0xc000b6dea0, 0x199}, 0xc0008f6a40)
    pkg/state/encode.go:537 +0x5a9
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeObject(0xc000604708, {0x123c4e0?, 0xc000b6dea0?, 0x30?}, 0x0, 0xc0008f6a40)
    pkg/state/encode.go:734 +0x5e5
gvisor.dev/gvisor/pkg/state.(*objectEncoder).save(0x11eecc0?, 0xc000b6dea0?, {0x123c4e0?, 0xc000b6dea0?, 0x11a8be0?})
    pkg/state/encode.go:478 +0x8c
gvisor.dev/gvisor/pkg/state.Sink.Save({{0xc000604708, 0xc0008f4ee8}}, 0x2, {0x11eecc0?, 0xc000b6dea0?})
    pkg/state/state.go:160 +0xa5
gvisor.dev/gvisor/pkg/sentry/mm.(*pmaFlatSegment).StateSave(0xc000b6de90, {{0xc000604708?, 0xc0008f4ee8?}})
    bazel-out/k8-fastbuild/bin/pkg/sentry/mm/mm_state_autogen.go:488 +0x90
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeStruct(0xc000604708, {0x11a8ca0, 0xc000b6de90, 0x199}, 0xc0001775e8)
    pkg/state/encode.go:537 +0x5a9
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeObject(0xc000604708, {0x11a8ca0?, 0xc000b6de90?, 0x7?}, 0x1, 0xc0001775e8)
    pkg/state/encode.go:734 +0x5e5
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeArray(0xc000604708, {0xc0005aeaf0?, 0xc000b6a000?, 0x199?}, 0xc0006fde20)
    pkg/state/encode.go:551 +0x110
gvisor.dev/gvisor/pkg/state.(*encodeState).encodeObject(0xc000604708, {0xc0005aeaf0?, 0xc000b6a000?, 0x30?}, 0x0, 0xc0006fde20)
    pkg/state/encode.go:715 +0x3bb
gvisor.dev/gvisor/pkg/state.(*encodeState).Save.func2()
    pkg/state/encode.go:771 +0x8e
gvisor.dev/gvisor/pkg/state.safely(0xc000604708?)
    pkg/state/state.go:322 +0x57
gvisor.dev/gvisor/pkg/state.(*encodeState).Save(0xc000604708, {0x12dbc20?, 0xc00032d888?, 0x0?})
    pkg/state/encode.go:764 +0x21e
gvisor.dev/gvisor/pkg/state.Save.func1()
    pkg/state/state.go:104 +0x98
gvisor.dev/gvisor/pkg/state.safely(0x0?)
    pkg/state/state.go:322 +0x57
gvisor.dev/gvisor/pkg/state.Save({0x7efe9a75fa98, 0xc000306160}, {0x7efe59672138, 0xc00128a580}, {0x12dc920, 0xc00032d888})
    pkg/state/state.go:103 +0x1d3
gvisor.dev/gvisor/pkg/sentry/kernel.(*Kernel).SaveTo(0xc00032d888, {0x154ae98, 0xc000306160}, {0x7efe59672138, 0xc00128a580}, 0x0, 0x0, {0x18?})
    pkg/sentry/kernel/kernel.go:646 +0x7d9
gvisor.dev/gvisor/pkg/sentry/state.SaveOpts.Save({{0x152b560, 0xc000effda0}, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc0001acf30, {0x0}, 0xc0010e2960, ...}, ...)
    pkg/sentry/state/state.go:102 +0x285
gvisor.dev/gvisor/pkg/sentry/control.(*State).Save(0xc0001a74e0, 0xc00001e3c0, 0x12f8365?)
    pkg/sentry/control/state.go:113 +0x425
gvisor.dev/gvisor/runsc/boot.(*Loader).save(0xc0004bc488, 0xc00001e3c0)
    runsc/boot/restore.go:327 +0x11f
gvisor.dev/gvisor/runsc/boot.(*containerManager).Checkpoint(0xc0004e9b60, 0xc00001e3c0, 0x0?)
    runsc/boot/controller.go:426 +0x58
reflect.Value.call({0xc0002014a0?, 0xc000098140?, 0xc000357c70?}, {0x12ec8aa, 0x4}, {0xc000357eb0, 0x3, 0xc000357ca0?})
    GOROOT/src/reflect/value.go:596 +0xce5
reflect.Value.Call({0xc0002014a0?, 0xc000098140?, 0xa0?}, {0xc000357eb0?, 0xc00001e3c0?, 0x16?})
    GOROOT/src/reflect/value.go:380 +0xb9
gvisor.dev/gvisor/pkg/urpc.(*Server).handleOne(0xc000108a50, 0xc00010cea0)
    pkg/urpc/urpc.go:338 +0x63b
gvisor.dev/gvisor/pkg/urpc.(*Server).handleRegistered(...)
    pkg/urpc/urpc.go:433
gvisor.dev/gvisor/pkg/urpc.(*Server).StartHandling.func1()
    pkg/urpc/urpc.go:453 +0x76
created by gvisor.dev/gvisor/pkg/urpc.(*Server).StartHandling in goroutine 50
    pkg/urpc/urpc.go:451 +0x6b

ayushr2 commented 4 months ago

This is a cuda-checkpoint issue: https://github.com/NVIDIA/cuda-checkpoint/issues/4.

mattnappo commented 4 months ago

You are correct. I just re-wrote my PyTorch program in CUDA, and runsc checkpoint worked. Until NVIDIA fixes this, could you advise how I could temporarily patch gVisor to close the FDs, similar to what was done here? Is this a simple task?

mattnappo commented 4 months ago

Update: I managed to get the torch example working by changing the panics to warnings in this file. This is obviously very hacky, and I wonder if there is a way to manually release the *nvproxy?

ayushr2 commented 4 months ago

FYI I ran the reproducer and got a different error. This is the same error from https://modal-public-assets.s3.amazonaws.com/gpu_ckpt_logs.zip: W0522 21:57:27.787759 25674 util.go:64] FATAL ERROR: checkpoint failed: checkpointing container "d83b8fa4-08bd-4d69-9a6f-5c3c28e98856": encoding error: can't save with live nvproxy clients:

It seems that the Can't save pma with non-MemoryFile of type *nvproxy.frontendFDMemmapFile error pasted above is from a different run where cuda-checkpoint was not run on PID 1.

Side note: need to fix some things in the Dockerfile:

Install wget
Import sys in main.py

could you advise how I could temporarily patch gVisor to close the FDs?

To properly close FDs during checkpointing, you would need to iterate all FDTables during checkpointing to find nvproxy FDs (via type-assertion) and release/remove them. Given that we can't reasonably expect applications to continue working correctly after silently closing some of their FDs, we probably wouldn't want this in mainline runsc.

mattnappo commented 4 months ago

It seems like NVIDIA is aware of this issue, and is working on a fix. Until then, I'll use this temporary patch in my prototyping. Thank you for your help! I'm glad that this isn't a gVisor issue after all.

github-actions[bot] commented 4 days ago

A friendly reminder that this issue had no activity for 120 days.

google / gvisor