Open mattnappo opened 4 months ago
This is a cuda-checkpoint issue: https://github.com/NVIDIA/cuda-checkpoint/issues/4.
You are correct. I just re-wrote my PyTorch program in CUDA, and runsc checkpoint
worked. Until NVIDIA fixes this, could you advise how I could temporarily patch gVisor to close the FDs, similar to what was done here? Is this a simple task?
Update: I managed to get the torch
example working by changing the panic
s to warnings
in this file. This is obviously very hacky, and I wonder if there is a way to manually release the *nvproxy
?
FYI I ran the reproducer and got a different error. This is the same error from https://modal-public-assets.s3.amazonaws.com/gpu_ckpt_logs.zip: W0522 21:57:27.787759 25674 util.go:64] FATAL ERROR: checkpoint failed: checkpointing container "d83b8fa4-08bd-4d69-9a6f-5c3c28e98856": encoding error: can't save with live nvproxy clients:
It seems that the Can't save pma with non-MemoryFile of type *nvproxy.frontendFDMemmapFile
error pasted above is from a different run where cuda-checkpoint
was not run on PID 1.
Side note: need to fix some things in the Dockerfile:
could you advise how I could temporarily patch gVisor to close the FDs?
To properly close FDs during checkpointing, you would need to iterate all FDTables during checkpointing to find nvproxy FDs (via type-assertion) and release/remove them. Given that we can't reasonably expect applications to continue working correctly after silently closing some of their FDs, we probably wouldn't want this in mainline runsc.
It seems like NVIDIA is aware of this issue, and is working on a fix. Until then, I'll use this temporary patch in my prototyping. Thank you for your help! I'm glad that this isn't a gVisor issue after all.
A friendly reminder that this issue had no activity for 120 days.
Description
Overview Hi, I'm with modal.com. We are interested in using a combination of cuda-checkpoint and
runsc checkpoint
in order to snapshot GPUs within gVisor. Thecuda-checkpoint
utility freezes a CUDA process, and copies the GPU state into the CPU memory. We have managed to successfully runcuda-checkpoint
from within a gVisor container. Ideally, we would then runrunsc checkpoint
(this is where the error lies). In principle, running the gVisor checkpointer after running the cuda checkpointer will checkpoint the GPU memory, ascuda-checkpoint
moves the GPU memory into the CPU, which is then saved byrunsc checkpoint
.Current Thinking We currently believe the reason for this error is that gVisor acquires GPU devices before checkpointing, which prevents the checkpoint from succeeding as there are device files left open. However, since gVisor doesn't need access to the GPU during a checkpoint, we believe that it should not hold the GPU device.
Potential Solution If this is indeed the source of the issue, then we would be content with a fix/patch that doesn't acquire the GPU devices, and makes it the user (our)'s job to keep track of mounting GPU devices on restore. If there is a way to make gVisor relinquish control of the GPU before checkpointing, that would also be desirable.
cc: @luiscape @thundergolfer
Steps to reproduce
Dockerfile:
Then follow the steps here to create an OCI bundle.
Run with
Then run
cuda-checkpoint
in the container (assuming pid ofpython3 /app/main.py
is1
):Up to this point, everything works (running
nvidia-smi
shows no GPU processes)Now, the gVisor checkpoint:
This should trigger the error.
runsc version
docker version (if using docker)
uname
Linux 5.15.0-1058-aws #64~20.04.1-Ubuntu x86_64 GNU/Linux
kubectl (if using Kubernetes)
repo state (if built from source)
N/A
runsc debug logs (if available)