Closed thundergolfer closed 2 months ago
IIUC train_gpt2cu
is trying to write to /dev/nvidiactl
at offset 0xc00. It first creates a writable mapping for /dev/nvidiactl
at 0x7f2c88000000. Then a subsequent read syscall attempts to read from FD 0x40 (/llm.c/gpt2_124M_bf16.bin
) into address 0x7f2c88000c00
(which is the mapped region start address + 0xc00).
I think we currently explicitly disallow this: https://github.com/google/gvisor/blob/3c4b246cf2947d6174af02fe895ff1c213ed7491/pkg/sentry/devices/nvproxy/frontend_mmap.go#L78-L83
Description
I was following https://github.com/karpathy/llm.c/discussions/677 using a
gpu_1x_h100_pcie
Lambda Labs instance and it fails onrunsc
with an issue I hadn't encountered before:I don't really understand this part of the
nvproxy
well so thought it'd be most productive to send it here :)Steps to reproduce
Run the following command on an image created from the Dockerfile below:
After around 10 seconds the following error shows:
Dockerfile
NVIDIA Driver
runsc version
I built from source at commit:
fa49677e141db94798f226dfb453f8771c14ae6f
.I used this version as it's the last before driver support for
535.129.03
was dropped.docker version (if using docker)
uname
Linux 209-20-158-39 6.2.0-37-generic #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 2 18:01:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
runsc debug logs (if available)
runsc.log.20240908-234318.130092.boot.txt