mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.6k stars 553 forks source link

RNNT apt-get update Fails in Docker Build #585

Closed coppock closed 2 months ago

coppock commented 1 year ago

I'm trying to run the RNNT training benchmark, but the Docker build results in a GPG error because of an unsigned repository. Is NVIDIA still supporting Ubuntu 18 with this repository?

training/rnn_speech_recognition/pytorch$ bash scripts/docker/build.sh
.
.
.
Reading package lists...
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease' is not signed.
The command '/bin/sh -c apt-get update &&     apt-get install -y libsndfile1 sox git cmake jq &&     apt-get install -y --no-install-recommends numactl &&     rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100
training/rnn_speech_recognition/pytorch$ 
coppock commented 1 year ago

Updating the base image from pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel to pytorch/pytorch:1.12.0-cuda11.3-cudnn8-devel results in a later error:

training/rnn_speech_recognition/pytorch$ bash scripts/docker/build.sh
.
.
.
creating build/temp.linux-x86_64-3.7/src                                        
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/cuda/include -fPIC -I/workspace/deps/warp-transducer/include -I/opt/conda/lib/python3.7/site-packages/torch/include -I/opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/include/THC -I/opt/conda/include/python3.7m -c src/binding.cpp -o build/temp.linux-x86_64-3.7/src/binding.o -fPIC -std=c++14 -DWARPRNNT_ENABLE_GPU -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -DTORCH_EXTENSION_NAME=warp_rnnt -D_GLIBCXX_USE_CXX11_ABI=0                      
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
src/binding.cpp:8:14: fatal error: THC.h: No such file or directory
     #include "THC.h"
              ^~~~~~~
compilation terminated.
setup.py:11: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if LooseVersion(torch.__version__) >= LooseVersion("1.5.0"):
/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  setuptools.SetuptoolsDeprecationWarning,
/opt/conda/lib/python3.7/site-packages/setuptools/command/easy_install.py:147: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  EasyInstallDeprecationWarning,
/opt/conda/lib/python3.7/site-packages/torch/utils/cpp_extension.py:411: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
error: command '/usr/bin/gcc' failed with exit code 1
The command '/bin/sh -c COMMIT_SHA=f546575109111c455354861a0567c8aa794208a2 &&     git clone https://github.com/HawkAaron/warp-transducer deps/warp-transducer &&     cd deps/warp-transducer &&     git checkout $COMMIT_SHA &&     sed -i 's/set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_30,code=sm_30 -O2")/#set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_30,code=sm_30 -O2")/g' CMakeLists.txt &&     sed -i 's/set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_75,code=sm_75")/set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_80,code=sm_80")/g' CMakeLists.txt &&     mkdir build &&     cd build &&     cmake .. &&     make VERBOSE=1 &&     export CUDA_HOME="/usr/local/cuda" &&     export WARP_RNNT_PATH=`pwd` &&     export CUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME &&     export D_LIBRARY_PATH="$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH" &&     export LIBRARY_PATH=$CUDA_HOME/lib64:$LIBRARY_PATH &&     export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH &&     export CFLAGS="-I$CUDA_HOME/include $CFLAGS" &&     cd ../pytorch_binding &&     python3 setup.py install &&     rm -rf ../tests test ../tensorflow_binding &&     cd ../../..' returned a non-zero code: 1
training/rnn_speech_recognition/pytorch$ 
mikolajblaz commented 1 year ago

Thanks for bringing this up, here is PR with a fix: https://github.com/mlcommons/training/pull/586

johntran-nv commented 1 year ago

@coppock while we're figuring out CLA issues in PR 586, could you try those changes out locally to see if you can get past this error?

coppock commented 1 year ago

I just now tested #586, and it allows the build to complete.

ShriyaPalsamudram commented 2 months ago

Closing based on comment