ORT+TensorRT build, "--config Debug" works but "--config Release" failed

yetingqiaqia commented 2 years ago

Describe the bug

Hi, I am trying to build ORT+TensorRT from scratch by following this wiki: https://onnxruntime.ai/docs/build/eps.html#tensorrt

Below command finished successfully:

./build.sh --cudnn_home /usr/lib/x86_64-linux-gnu/ --cuda_home /usr/local/cuda/ --use_cuda \
--use_tensorrt --tensorrt_home /usr/src/tensorrt/ \
--build_wheel

But it generated the .whl file in build/Linux/Debug/dist folder. However, I expect to build a Release version, so, I searched ORT wiki and appended "--config Release" in the command, like below:

./build.sh --cudnn_home /usr/lib/x86_64-linux-gnu/ --cuda_home /usr/local/cuda/ --use_cuda \
--use_tensorrt --tensorrt_home /usr/src/tensorrt/ \
--build_wheel --config Release

However, it failed, with below errors: The CMAKE_CUDA_COMPILIER: /usr/bin/nvcc is not a full path to an existing compiler tool.

I carefully compared their build summary, the only difference is on CUDA compiler part.

Build summary in "--config Debug":
Build summary in "--config Release":

Where, the /usr/local/cuda/bin/nvcc is the correct one (to be accurate, it is a softlink, which points to the correct path /usr/local/cuda-11.5/bin/nvcc). And /usr/bin/nvcc is just a quite old one.

Below fix doesn't work I checked, /usr/local/cuda/bin/ is already in $PATH: /usr/local/nvm/versions/node/v14.5.0/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin And I tried to set export CUDACXX=/usr/local/cuda/bin/nvcc. It doesn't work.

I fixed it by adding CMAKE_CUDA_COMPILER in ./cmake/CMakeLists.txt See below:

However, there is one thing that I feel it is quite weird: why --config Debug could find the correct nvcc, but --config Release couldn't? I feel it is a bug. Please help investigate. Thanks.

System information I don't think below system information matters for this bug, but I just provided them below:

OS Platform and Distribution: 18.04
ONNX Runtime installed from (source or binary): latest source code
ONNX Runtime version: 1.10
Python version: 3.6
Visual Studio version (if applicable): N/A
GCC/Compiler version (if compiling from source): 7.5.0
CUDA/cuDNN version: CUDA 11.5, CuDNN 8.3.1, TensorRT 8.2.1
GPU model and memory: V100, 16GB

Expected behavior I expect "--config Release" should be able to find the same nvcc as "--config Debug"

jywu-msft commented 2 years ago

Please use TensorRT 8.0 as the build instructions indicate. If you want to build with TensorRT 8.2, you will need to update the onnx-tensorrt submodule to the 8.2-GA branch

not sure why it's not finding the correct nvcc compiler. you seem to have multiple cuda versions, perhaps adding build option --cuda_version=11.5 would help. we actually haven't tested with cuda 11.5 yet, our build pipelines are still using cuda 11.4.x

yetingqiaqia commented 2 years ago

@jywu-msft , but it is so weird that "--config Debug" actually works. For adding build option --cuda_version=11.5, I tried before. It didn't work. For me, I already fixed the build issue by adding set(CMAKE_CUDA_COMPILER, '/usr/local/cuda/bin/nvcc') in ./cmake/CMakeLists.txt. This ticket is just a bug report to ORT team.

jywu-msft commented 2 years ago

@jywu-msft , but it is so weird that "--config Debug" actually works. For adding build option --cuda_version=11.5, I tried before. It didn't work. For me, I already fixed the build issue by adding set(CMAKE_CUDA_COMPILER, '/usr/local/cuda/bin/nvcc') in ./cmake/CMakeLists.txt. This ticket is just a bug report to ORT team.

yes, thanks for reporting the issue. we'll need to reproduce it on our end and investigate further.

jywu-msft commented 2 years ago

looking again at the image you posted above, it shows "ptxas fatal : Value 'sm_30' is not defined for option 'gpu-name' " sm_30 corresponds to K series, wonder why it is building against that architecture. You have a V100 which should be sm_70 Also in your release vs debug snapshot , it shows a different C++ compiler version 7.4.0 vs 7.5.0 , is there anything else different in your environment for those 2 builds?

FWIW, I tested the combination of TRT 8.2/CUDA 11.5/CuDNN 8.3 on Windows and didn't encounter this issue. I will need to try Linux next.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

microsoft / onnxruntime

ORT+TensorRT build, "--config Debug" works but "--config Release" failed #9934