Closed Eecornwell closed 8 months ago
Actually...not sure how this was working with the new base image since nerfacc requires cuda <=11.8. Also looking at the above script, looks like I left out the python dev library. Going to retry with the python change and roll back the base image to the original and try.
Ah, looks like an NVIDIA driver issue when using your published Dockerfile (using V100s)
RuntimeError: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.
Ok...built everything from scratch on a new machine and seems to have resolved itself. The only noticeable difference is I am building the docker image on an AL2 OS with preinstalled CUDA where as before with the memory leak was on an Ubuntu 22.04.
Turns out, I misinterpreted how checkpoint.every_n_train_steps
works. I assumed that would update the ckpts/last.ckpt
file, but it looks like it appends data in the ckpts
folder. @DSaurus is this normal behavior? If so, is there a better way to ensure a ckpt is written locally, which overwrites the previous last.ckpt?
I ended up setting checkpoint.every_n_train_steps
and then have another thread that does clean-up in the checkpoint folder to remove the old checkpoint files. I only keep last.ckpt and the max iteration ckpt on an interval.
While using the provided Dockerfile in the repo, I run into this error which leads me to believe the Dockerfile is out-of-date. Does anyone here have a working docker configuration? I was able to use the modified Dockerfile below, but am seeing a memory leak when I deploy and run it. I can run one trial before the 30GB disk gets filled up. Any ideas why the disk is getting filled up? I am running the launch.py script on launch and feeding it the standard parameters.
Command:
Error:
New Dockerfile (no error, but memory leak):