Dockerfile and CUDA and CUDNN issues with GPU detected

samhodge-aiml commented 3 months ago

Trying to build from a CI/CD I got the following builderror.txt.zip

see zip attached

important parts

-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so  
-- Found cuDNN: v8.9.6  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
CMake Warning at External/libtorch/share/cmake/Caffe2/public/cuda.cmake:214 (message):
  Failed to compute shorthash for libnvrtc.so
Call Stack (most recent call first):
  External/libtorch/share/cmake/Caffe2/Caffe2Config.cmake:92 (include)
  External/libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:78 (find_package)

-- Automatic GPU detection failed. Building for common architectures.
-- Autodetected CUDA architecture(s): 3.5;5.0;5.2;6.0;6.1;7.0;7.5;8.0;8.6;8.6+PTX
-- Added CUDA NVCC flags for: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86
-- Found Torch: /app/External/libtorch/lib/libtorch.so  
-- Package torch                      Yes, at /app/External/libtorch/include;/app/External/libtorch/include/torch/csrc/api/include
CMAKE_EXE_LINKER_FLAGS before: -Wl,--no-as-needed
TORCH_LIBRARIES: torch;torch_library;/app/External/libtorch/lib/libc10.so;/app/External/libtorch/lib/libkineto.a;/usr/local/cuda/lib64/stubs/libcuda.so;/usr/local/cuda/lib64/libnvrtc.so;/usr/local/cuda/lib64/libnvToolsExt.so;/usr/local/cuda/lib64/libcudart.so;/app/External/libtorch/lib/libc10_cuda.so
TORCH_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=1
CMAKE_EXE_LINKER_FLAGS after: -Wl,--as-needed
CUDNN_LIBRARY_PATH: /usr/lib/x86_64-linux-gnu/libcudnn.so; CUDNN_INCLUDE_PATH: /usr/include
-- Obtained CUDA architectures automatically from installed GPUs
-- Automatic GPU detection failed. Building for Turing and Ampere as a best guess.
-- Targeting CUDA architectures: 75;86

-- SAIGA_CUDA_VERSION 
-- Found CUDAToolkit: /usr/local/cuda/targets/x86_64-linux/include (found suitable version "11.8.89", minimum required is "10.2") 
-- Enabled CUDA. Version: 11.8.89
-- Package CUDA::cudart               Yes, at /usr/local/cuda/targets/x86_64-linux/include
-- Package CUDA::nppif                Yes, at /usr/local/cuda/targets/x86_64-linux/include
-- Package CUDA::nppig                Yes, at /usr/local/cuda/targets/x86_64-linux/include
-- SAIGA_CUDA_FLAGS: -Xcompiler=-fopenmp;-Xcompiler=-march=native;-use_fast_math;--expt-relaxed-constexpr;-Xcudafe=--diag_suppress=esa_on_defaulted_function_ignored;-Xcudafe=--diag_suppress=field_without_dll_interface;-Xcudafe=--diag_suppress=base_class_has_different_dll_interface;-Xcudafe=--diag_suppress=dll_interface_conflict_none_assumed;-Xcudafe=--diag_suppress=dll_interface_conflict_dllexport_assumed
-- Using automatic CUDA Arch detection...
-- Automatic GPU detection failed. Building for common architectures.
-- Autodetected CUDA architecture(s): 3.5;5.0;5.2;6.0;6.1;7.0;7.5;8.0;8.6;8.6+PTX
-- SAIGA_CUDA_ARCH: 
-- SAIGA_CUDA_ARCH_FLAGS: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86
-- 
Compiler Flags:
-- SAIGA_CXX_FLAGS: -Wall;-Werror=return-type;-Wno-strict-aliasing;-Wno-sign-compare;-march=native;-fopenmp
-- SAIGA_PRIVATE_CXX_FLAGS: -fvisibility=hidden
-- SAIGA_LD_FLAGS: -fopenmp
-- CMAKE_CXX_FLAGS: 
-- CMAKE_CXX_FLAGS_DEBUG: -g
-- CMAKE_CXX_FLAGS_RELWITHDEBINFO: -O2 -g -DNDEBUG
-- CMAKE_CXX_FLAGS_RELEASE: -O3 -DNDEBUG
-- 
CUDA Compiler Flags:
-- CMAKE_CUDA_FLAGS: 
-- CMAKE_CUDA_FLAGS_DEBUG: -g
-- CMAKE_CUDA_FLAGS_RELWITHDEBINFO: -O2 -g -DNDEBUG
-- CMAKE_CUDA_FLAGS_RELEASE: -O3 -DNDEBUG

[ 17%] Built target signalhandler_unittest
[ 17%] Building CUDA object External/tiny-cuda-nn/CMakeFiles/tiny-cuda-nn.dir/src/common.cu.o
nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
make[2]: *** [External/tiny-cuda-nn/CMakeFiles/tiny-cuda-nn.dir/build.make:77: External/tiny-cuda-nn/CMakeFiles/tiny-cuda-nn.dir/src/common.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:2579: External/tiny-cuda-nn/CMakeFiles/tiny-cuda-nn.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....

samhodge-aiml commented 3 months ago

@michaldigimansai

Do you have anything you can suggest?

abecadel commented 3 months ago

looks like there is no GPU available

samhodge-aiml commented 3 months ago

I was able to build the nerfstudio Docker image in the same CI/CD environment and that used the GPU for nvcc calls.

samhodge-aiml commented 3 months ago

It sort of relates to

https://stackoverflow.com/questions/59691207/docker-build-with-nvidia-runtime/61737404#61737404

Because the same thing happened on my machine, I will look into a kaniko based solution in kubernetes because this is the runtime environment.

samhodge-aiml commented 3 months ago

Looks like docker in docker kaniko is required

https://gitlab.cvh-server.de/ckaufmann/gpu-cluster-images/-/blob/master/README.md#build-with-kaniko

lfranke / TRIPS

Dockerfile and CUDA and CUDNN issues with GPU detected #47