threestudio-project / threestudio

A unified framework for 3D content generation.
Apache License 2.0
6.17k stars 475 forks source link

Docker error, CUDA not found #389

Closed Eecornwell closed 8 months ago

Eecornwell commented 9 months ago

While using the provided Dockerfile in the repo, I run into this error which leads me to believe the Dockerfile is out-of-date. Does anyone here have a working docker configuration? I was able to use the modified Dockerfile below, but am seeing a memory leak when I deploy and run it. I can run one trial before the 30GB disk gets filled up. Any ideas why the disk is getting filled up? I am running the launch.py script on launch and feeding it the standard parameters.

Command:

python launch.py --config configs/dreamfusion-if.yaml --train --gpu 0 system.prompt_processor.prompt="A cactus"

Error:

7.417 Building wheels for collected packages: nerfacc 7.421 Building wheel for nerfacc (setup.py): started 10.40 Building wheel for nerfacc (setup.py): finished with status 'error' 10.41 error: subprocess-exited-with-error 10.41 10.41 × python setup.py bdist_wheel did not run successfully. 10.41 │ exit code: 1 10.41 ╰─> [155 lines of output] 10.41 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' 10.41 running bdist_wheel 10.41 running build 10.41 running build_py 10.41 creating build 10.41 creating build/lib.linux-x86_64-cpython-310 10.41 creating build/lib.linux-x86_64-cpython-310/nerfacc 10.41 copying nerfacc/init.py -> build/lib.linux-x86_64-cpython-310/nerfacc 10.41 copying nerfacc/grid.py -> build/lib.linux-x86_64-cpython-310/nerfacc 10.41 copying nerfacc/pack.py -> build/lib.linux-x86_64-cpython-310/nerfacc 10.41 copying nerfacc/cameras.py -> build/lib.linux-x86_64-cpython-310/nerfacc 10.41 copying nerfacc/data_specs.py -> build/lib.linux-x86_64-cpython-310/nerfacc 10.41 copying nerfacc/pdf.py -> build/lib.linux-x86_64-cpython-310/nerfacc 10.41 copying nerfacc/version.py -> build/lib.linux-x86_64-cpython-310/nerfacc 10.41 copying nerfacc/scan.py -> build/lib.linux-x86_64-cpython-310/nerfacc 10.41 copying nerfacc/cameras2.py -> build/lib.linux-x86_64-cpython-310/nerfacc 10.41 copying nerfacc/volrend.py -> build/lib.linux-x86_64-cpython-310/nerfacc 10.41 creating build/lib.linux-x86_64-cpython-310/nerfacc/cuda 10.41 copying nerfacc/cuda/init.py -> build/lib.linux-x86_64-cpython-310/nerfacc/cuda 10.41 copying nerfacc/cuda/_backend.py -> build/lib.linux-x86_64-cpython-310/nerfacc/cuda 10.41 creating build/lib.linux-x86_64-cpython-310/nerfacc/estimators 10.41 copying nerfacc/estimators/occ_grid.py -> build/lib.linux-x86_64-cpython-310/nerfacc/estimators 10.41 copying nerfacc/estimators/init.py -> build/lib.linux-x86_64-cpython-310/nerfacc/estimators 10.41 copying nerfacc/estimators/base.py -> build/lib.linux-x86_64-cpython-310/nerfacc/estimators 10.41 copying nerfacc/estimators/prop_net.py -> build/lib.linux-x86_64-cpython-310/nerfacc/estimators 10.41 running egg_info 10.41 creating nerfacc.egg-info 10.41 writing nerfacc.egg-info/PKG-INFO 10.41 writing dependency_links to nerfacc.egg-info/dependency_links.txt 10.41 writing requirements to nerfacc.egg-info/requires.txt 10.41 writing top-level names to nerfacc.egg-info/top_level.txt 10.41 writing manifest file 'nerfacc.egg-info/SOURCES.txt' 10.41 reading manifest file 'nerfacc.egg-info/SOURCES.txt' 10.41 reading manifest template 'MANIFEST.in' 10.41 warning: no files found matching 'nerfacc/_cuda/csrc/include/' 10.41 warning: no files found matching 'nerfacc/_cuda/csrc/' 10.41 adding license file 'LICENSE' 10.41 writing manifest file 'nerfacc.egg-info/SOURCES.txt' 10.41 /usr/local/lib/python3.10/dist-packages/setuptools/command/build_py.py:207: _Warning: Package 'nerfacc.cuda.csrc' is absent from the packages configuration. 10.41 !!

New Dockerfile (no error, but memory leak):

FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel

ARG USER_NAME=dreamer
ARG GROUP_NAME=dreamers
ARG UID=1000
ARG GID=1000

# Set compute capability for nerfacc and tiny-cuda-nn
# See https://developer.nvidia.com/cuda-gpus and limit number to speed-up build
ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX"
ENV TCNN_CUDA_ARCHITECTURES=90;89;86;80;75;70;61;60
# Speed-up build for RTX 30xx
# ENV TORCH_CUDA_ARCH_LIST="8.6"
# Speed-up build for RTX 40xx
# ENV TORCH_CUDA_ARCH_LIST="8.9"
# ENV TCNN_CUDA_ARCHITECTURES=89

# Set CUDA Environment Vars
ENV CUDA_HOME=/usr/local/cuda
ENV PATH=${CUDA_HOME}/bin:/home/${USER_NAME}/.local/bin:${PATH}
ENV LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}
ENV LIBRARY_PATH=${CUDA_HOME}/lib64/stubs:${LIBRARY_PATH}

# Install pre-dependencies
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    build-essential \
    curl \
    git \
    libegl1-mesa-dev \
    libgl1-mesa-dev \
    libgles2-mesa-dev \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender1 \
    python-is-python3 \
    python3.10 \
    python3-pip \
    wget \
    && rm -rf /var/lib/apt/lists/*

# Install and upgrade pip and dependencies
RUN pip install --upgrade pip setuptools ninja diffusers==0.20.2 mediapipe
RUN pip install torch torchvision torchaudio

# Install nerfacc and tiny-cuda-nn before installing requirements.txt
# because these two installations are time consuming and error prone
RUN pip install git+https://github.com/KAIR-BAIR/nerfacc.git@v0.5.2
RUN pip install git+https://github.com/NVlabs/tiny-cuda-nn.git#subdirectory=bindings/torch

# Clone ThreeStudio
RUN git clone https://github.com/threestudio-project/threestudio.git /home/${USER_NAME)/workspace/threestudio
WORKDIR /home/${USER_NAME)/workspace/threestudio
RUN git checkout 8a51c37317b6f7cd74bb3cb24c975b56d0a96703
RUN pip install -r requirements.txt

# Change user to non-root user
RUN groupadd -g ${GID} ${GROUP_NAME} \
    && useradd -ms /bin/sh -u ${UID} -g ${GID} ${USER_NAME}
USER ${USER_NAME}

WORKDIR /home/${USER_NAME}/workspace/threestudio

image

Eecornwell commented 9 months ago

Actually...not sure how this was working with the new base image since nerfacc requires cuda <=11.8. Also looking at the above script, looks like I left out the python dev library. Going to retry with the python change and roll back the base image to the original and try.

Eecornwell commented 9 months ago

Ah, looks like an NVIDIA driver issue when using your published Dockerfile (using V100s)

RuntimeError: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.
Eecornwell commented 9 months ago

Ok...built everything from scratch on a new machine and seems to have resolved itself. The only noticeable difference is I am building the docker image on an AL2 OS with preinstalled CUDA where as before with the memory leak was on an Ubuntu 22.04. image

Eecornwell commented 8 months ago

Turns out, I misinterpreted how checkpoint.every_n_train_steps works. I assumed that would update the ckpts/last.ckpt file, but it looks like it appends data in the ckpts folder. @DSaurus is this normal behavior? If so, is there a better way to ensure a ckpt is written locally, which overwrites the previous last.ckpt?

Eecornwell commented 8 months ago

I ended up setting checkpoint.every_n_train_steps and then have another thread that does clean-up in the checkpoint folder to remove the old checkpoint files. I only keep last.ckpt and the max iteration ckpt on an interval.