microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.41k stars 2.89k forks source link

[Build] Error building with tensorrt on Linux #17991

Closed BengtGustafsson closed 2 weeks ago

BengtGustafsson commented 12 months ago

Describe the issue

When trying to build onnxruntime with -use_tensorrt I get basically the error that cmake --build can't find any of the files that should have been created by the cmake configuration run.

Unfortunately build.py does not provide a way to see the output of cmake if there is no error return. Equally unfortunately cmake does not seem to return an error code even though it erred out and didn't return any files.

I hacked build.py to unconditionally print stdout and stderr from the cmake subprocess and got this error message:

NVCC_ERROR = NVCC_OUT = No such file or directory CMake Error at /usr/local/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message): Failed to detect a default CUDA architecture. Compiler output: Call Stack (most recent call first): CMakeLists.txt:674 (enable_language)

There was also earlier in stderr a complaint about CMP0104 being "OLD":

CMake Deprecation Warning at CMakeLists.txt:14 (cmake_policy): The OLD behavior for policy CMP0104 will be removed from a future version of CMake.

This seems to be related to not setting up any default CUDA architecture.

I don't know if it is normal that building tensorrt requires this.

My initial thought was that building tensorrt without also building cuda was not supported, as there seemed to be some code that sets the variable CMAKE_CUDA_ARCHITECTURES mentioned in https://cmake.org/cmake/help/latest/policy/CMP0104.html

This didn't help at all.

Now I'm totally at a loss.

Urgency

Medium urgency. There is still time before our deadline, we just want to be able to build onnxruntime on all paltforms with the best providers for each graphics board make. We don't know what make our customers have so we need to have an omnipotent onnxruntime library which does not seem to exist as precompiled binaries.

Target platform

Linux 64 bit, ubuntu 20.04 using NVIDIA docker image

Build script

build.py --build_dir /src/onnxruntime/build/Linux --config Release --parallel 4 --build_shared_lib --build_dir build_gpu/Linux --skip_tests --use_tensorrt --cuda_version=11.8 --cuda_home /usr/local/cuda --cudnn_home /usr/local/cuda --tensorrt_home /usr/include/x86_64-linux-gnu

Note that the --tensorrt_home directory is what I could guess after installing apt install tensorrt-dev. This is not documented and the frustrated questions found on the web have not been answered. It would be appreciated if the installation instructions were a bit more elabroate than "install tenstorrt". Took me days to figure out that what you mean was tensorrt-dev and then to find the directory of the header files.

I don't think that this directory selection is the root cause in this case though.

Error / output

NVCC_ERROR = NVCC_OUT = No such file or directory CMake Error at /usr/local/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message): Failed to detect a default CUDA architecture. Compiler output: Call Stack (most recent call first): CMakeLists.txt:674 (enable_language)

printed to stderr by the cmake configure step but this does not stop build.py from trying to do cmake --build which of course fails as there are no files for it to act on.

I also print all thearguments you send to cmake, but I have no idee what to look for so here you go: Namespace(acl_home=None, acl_libs=None, allow_running_as_root=False, android=False, android_abi='arm64-v8a', android_api=27, android_cpp_shared=False, android_ndk_path='', android_run_emulator=False, android_sdk_path='', apple_deploy_target=None, arm=False, arm64=False, arm64ec=False, armnn_bn=False, armnn_home=None, armnn_libs=None, armnn_relu=False, build=False, build_apple_framework=False, build_csharp=False, build_dir='build_gpu/Linux', build_java=False, build_micro_benchmarks=False, build_nodejs=False, build_nuget=False, build_objc=False, build_shared_lib=True, build_wasm=False, build_wasm_static_lib=False, build_wheel=False, cann_home=None, clean=False, cmake_extra_defines=None, cmake_generator=None, cmake_path='cmake', code_coverage=False, compile_no_warning_as_error=False, config=['Release'], ctest_path='ctest', cuda_home='/usr/local/cuda', cuda_version='11.8', cudnn_home='/usr/local/cuda', disable_contrib_ops=False, disable_exceptions=False, disable_memleak_checker=False, disable_ml_ops=False, disable_rtti=False, disable_types=[], disable_wasm_exception_catching=False, dml_external_project=False, dml_path='', dnnl_gpu_runtime='', dnnl_opencl_root='', eigen_path=None, emscripten_settings=None, emsdk_version='3.1.44', enable_cuda_line_info=False, enable_cuda_profiling=False, enable_external_custom_op_schemas=False, enable_language_interop_ops=False, enable_lazy_tensor=False, enable_lto=False, enable_memory_profile=False, enable_msinternal=False, enable_msvc_static_runtime=False, enable_nccl=False, enable_nvtx_profile=False, enable_onnx_tests=False, enable_pybind=False, enable_reduced_operator_type_support=False, enable_rocm_profiling=False, enable_symbolic_shape_infer_tests=False, enable_training=False, enable_training_apis=False, enable_training_ops=False, enable_transformers_tool_test=False, enable_wasm_api_exception_catching=False, enable_wasm_debug_info=False, enable_wasm_exception_throwing_override=True, enable_wasm_profiling=False, enable_wasm_simd=False, enable_wasm_threads=False, enable_wcos=False, extensions_overridden_path=None, external_graph_transformer_path=None, fuzz_testing=False, gdk_edition='.', gdk_platform='Scarlett', gen_api_doc=False, gen_doc=None, include_ops_by_config=None, ios=False, ios_sysroot='', ios_toolchain_file='', llvm_config='', llvm_path=None, migraphx_home=None, minimal_build=None, mpi_home=None, ms_experimental=False, msbuild_extra_options=None, msvc_toolset=None, nccl_home=None, nnapi_min_api=None, numpy_version=None, nvcc_threads=-1, osx_arch='x86_64', parallel=4, path_to_protoc_exe=None, qnn_home=None, rocm_home=None, rocm_version=None, skip_keras_test=False, skip_nodejs_tests=False, skip_onnx_tests=False, skip_submodule_sync=False, skip_tests=True, skip_winml_tests=False, snpe_root=None, target=None, tensorrt_home='/usr/include/x86_64-linux-gnu', test=False, test_all_timeout='10800', tvm_cuda_runtime=False, update=False, use_acl=None, use_armnn=False, use_azure=False, use_cache=False, use_cann=False, use_coreml=False, use_cuda=False, use_dml=False, use_dnnl=False, use_extensions=False, use_full_protobuf=False, use_gdk=False, use_jsep=False, use_lock_free_queue=False, use_migraphx=False, use_mimalloc=False, use_mpi=False, use_nnapi=False, use_openvino=None, use_preinstalled_eigen=False, use_qnn=False, use_rknpu=False, use_rocm=False, use_snpe=False, use_telemetry=False, use_tensorrt=True, use_tensorrt_builtin_parser=True, use_tensorrt_oss_parser=False, use_triton_kernel=False, use_tvm=False, use_tvm_hash=False, use_vitisai=False, use_webnn=False, use_winml=False, use_xnnpack=False, wasm_malloc=None, wasm_run_tests_in_browser=False, wheel_name_suffix=None, windows_sdk_version=None, winml_root_namespace_override=None, x86=False, xcode_code_signing_identity='', xcode_code_signing_team_id='') Failed to import psutil. Please pip install psutil for better estimation of nvcc threads. Use nvcc_threads=1 Making dir: build_gpu/Linux/Release Calling cmake with 91 arguments.

Visual Studio Version

No response

GCC / Compiler Version

gcc 10.4, cmake 3.27, pthon 3.10 running in docker image FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04

BengtGustafsson commented 12 months ago

mobile was wrongly guessed by bot. This is on Linux/desktop.

jywu-msft commented 12 months ago

fyi, to run cmake without building you can omit --build and only pass --update to the build script. --update --build will generate makefiles and build (but not test) and the default for the build script is --update --build --test the reason we link to tensorrt instructions in our build instructions is there are different ways to install tensorrt (on different platforms) and other nvidia dependencies (cuda, cudnn etc.). Can you use this dockerfile as reference you can see how it installs tensorrt packages at https://github.com/microsoft/onnxruntime/blob/35ecce45496ee752bc3f85618eba713bb50e6069/tools/ci_build/github/linux/docker/Dockerfile.ubuntu_cuda11_8_tensorrt8_6#L34

jywu-msft commented 12 months ago

btw, i suspect the source of your issue is you're installing tensorrt-dev without specifying a version so I think it's picking up the latest version that's built against cuda 12.x , which may conflict/update base image version of 11.8 , that's why cmake isn't able to figure out which cuda compiler to use. you can confirm in your logs that it is doing this (probably you will see it installing tensorrt and also any dependencies on cuda 12.x packages)

jywu-msft commented 12 months ago

tensorrt install instructions are at https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#installing

the relevant section states "When using the CUDA network repository, Ubuntu will by default install TensorRT for the latest CUDA version. The following commands will install libnvinfer8 and related TensorRT packages for an older CUDA version and hold the libnvinfer8 package at this version. Replace 8.x.x.x with your version of TensorRT and cudax.x with your CUDA version for your install. version="8.x.x.x-1+cudax.x" sudo apt-get install tensorrt-dev=${version}

sudo apt-mark hold tensorrt-dev If you want to upgrade to the latest version of TensorRT or the latest version of CUDA, then you can unhold the libnvinfer-dev package using the following command. sudo apt-mark unhold tensorrt-dev You may need to repeat these steps for libcudnn8 to prevent cuDNN from being updated to the latest CUDA version. Refer to the NVIDIA TensorRT Release Notes for the specific version of cuDNN that was tested with your version of TensorRT. Example commands for downgrading and holding the cuDNN version can be found in Upgrading TensorRT. Refer to the NVIDIA cuDNN Installation Guide for additional information.

If the CUDA network repository and a TensorRT local repository are enabled at the same time you may observe package conflicts with either TensorRT or cuDNN. You will need to configure APT so that it prefers local packages over network packages. You can do this by creating a new file at /etc/apt/preferences.d/local-repo with the following lines: Package: * Pin: origin "" Pin-Priority: 1001 Note: This preference change will affect more than just TensorRT in the unlikely event that you have other repositories which are also not downloaded over HTTP(S). To revert APT to its original behavior simply remove the newly created file."

BengtGustafsson commented 11 months ago

After I adjusted the TensorRT install to be for CUDA 11.8 I get a compilation that starts it seems, but never finishes (with never being defined as 8 hours). After this gitlab kills the job and I don't get the printouts I added to build.py so I added an internal timeout in build.py of 2 hours and now I see that processing is killed when working on a file: [ 81%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/src/onnxruntime/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_hdim224_fp16_sm80.cu.o

Unfortunately there are no time stamps so I don't know if it actually took 2 hours to get there. I will retry with a longer timeout.

But I also see this in the log: NVCC_ERROR = nvcc fatal : Unknown option '-Wstrict-aliasing'

Could this be related or should it be disregarded even if it says fatal? I found one other bug report (#12922) that had a log pasted with this contents, but the discussion there does not mention it.

jywu-msft commented 11 months ago

After I adjusted the TensorRT install to be for CUDA 11.8 I get a compilation that starts it seems, but never finishes (with never being defined as 8 hours). After this gitlab kills the job and I don't get the printouts I added to build.py so I added an internal timeout in build.py of 2 hours and now I see that processing is killed when working on a file: [ 81%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/src/onnxruntime/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_hdim224_fp16_sm80.cu.o

Unfortunately there are no time stamps so I don't know if it actually took 2 hours to get there. I will retry with a longer timeout.

But I also see this in the log: NVCC_ERROR = nvcc fatal : Unknown option '-Wstrict-aliasing'

Could this be related or should it be disregarded even if it says fatal? I found one other bug report (#12922) that had a log pasted with this contents, but the discussion there does not mention it.

does it consistently hang on compiling that obj ? if you adjust your timeout to 3 hours, is it the same? what are the specs on your build node? what is the gpu type? @tianleiwu have you seen build hang on building flash_fwd_hdim224_fp16_sm80.cu.o before?

tianleiwu commented 11 months ago

@BengtGustafsson,

Example script to build in Linux (I build only one SM and disable tests, and it usually took a few minutes to build): https://github.com/microsoft/onnxruntime/blob/26a7b63716e3125bfe35fe3663ba10d2d7322628/build_release.sh Windows: https://github.com/microsoft/onnxruntime/blob/8df5f4e0df1f3b9ceeb0f1f2561b09727ace9b37/build_trt.cmd

If your build failed in flash attention, usually the cause is out of memory. Since you are using nvcc_threads 1 (I saw the message Failed to import psutil. Please pip install psutil for better estimation of nvcc threads. Use nvcc_threads=1), you might need limit the parallel jobs https://github.com/microsoft/onnxruntime/blob/2c6b31c5aa05bdce26ccd1af58bb194f880166ed/tools/ci_build/build.py#L161), or let docker use more memory (at least 32GB), or try a machine with more memory.

BengtGustafsson commented 11 months ago

I pip install psutil in the docker container and allow it 46 GByte. Compilation still hangs forever. So I changed back to the unpatched onnxruntime repo to see if I messed something up but it still hangs but now I don't see anything as the output is not captured.

Overall onnxruntime building is extremely brittle and when it fails it is impossible to figure out why.

I would love to use precompiled binaries but I can't as you don't provide any binaries that can use any make GPU which we must do as we don't know what our customers have. I don't actually understand why onnxruntime.dll/so can't always support all providers so that we can then download provider packages which work with any onnxruntime libraries. I have now spent man months trying to compile this (after a consultant tried for 6 months with limited success). And then I haven't even started on the worst platforms we need it to work on, Android and iOS. There must be a better way!

BengtGustafsson commented 11 months ago

I get this:

tools_python_utils [INFO] - flatbuffers module is not installed. parse_config will not be available

Which I treated as a warning. I have no idea what parse_config is and I don't know if it wants me to pip install flatbuffers or apt install flatbuffers. My hope was that maybe this has to do with not being able to show any error messages when cmake fails, but that's clutching at straws really.

jywu-msft commented 11 months ago

I get this:

tools_python_utils [INFO] - flatbuffers module is not installed. parse_config will not be available

Which I treated as a warning. I have no idea what parse_config is and I don't know if it wants me to pip install flatbuffers or apt install flatbuffers. My hope was that maybe this has to do with not being able to show any error messages when cmake fails, but that's clutching at straws really.

that message is coming from https://github.com/microsoft/onnxruntime/blob/dabd395fdfdb3c5edf91d3b515bab00744b63c60/tools/python/util/__init__.py#L14 you can pip install flatbuffers , but this seems unrelated to your build issues since it's just an INFO message.

is it still hanging building the same obj file flash_fwd_hdm224_fp16_sm80 ? do you have any logs from when you had installed psutil ? how many nvcc_threads did it configure? the relevant section @tianleiwu was referring to is at https://github.com/microsoft/onnxruntime/blob/dabd395fdfdb3c5edf91d3b515bab00744b63c60/tools/ci_build/build.py#L881 it tries to estimate the number of nvcc_threads to use, but needs psutil to do so. otherwise it defaults to nvcc_threads=1

there are 2 other experiments you can try. 1) build with --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 this builds a single CUDA architecture , which should speed up build times. if that still doesn't work, try 2) build with --disable_contrib_ops
that will disable building contrib ops (which you mentioned you were hung building flash_fwd_hdim224_fp16_sm80.cu.o ) then we can try to narrow it down some more.

BengtGustafsson commented 11 months ago

I've had so many different issues with this. I'm still not even sure that I was able to enable more memory to docker. I put a free -m in the docker run step but apparently it reports the memory in the host, not in the docker image. So now I'm running with a hard limit nvcc_thread = 1 I hacked into your build.py in my fork for testing. Before that the latest run ended with this:

Finished fetching external dependencies NVCC_ERROR = nvcc fatal : Unknown option '-Wstrict-aliasing' NVCC_OUT = 1

This is the last output of the stderr stream of the cmake config step, Cmake then returns 0 as build.py continues with the cmake --build step which hangs without any output for 8 hours, despite the fact that I give a 2 hour timeout to Python's subprocess.run, so cmake seems not to be responding to signals while building. Its fairly frustrating.

Which I have asked about before. It would seem that we have a too-old nvcc, but that doesn't explain why it hangs. Our docker image starts "FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04". We then explicitly install gcc 11.2. I hope they haven't removed this option. We've had strange errors before where it turned out that the problem was not actually in the cuda compiler itself but when it launched the host compiler.

From this: https://itecnote.com/tecnote/c-nvcc-strange-interaction-with-xcompiler/ it seems that the option may be directed to the wrong compiler. Well, I grepped for strict-alias in your .py and .cmake files and came up with a few no-strict-aliases and a few strict-aliases in nlohmann_json's cmake files but I hope you are not compiling json parsers with nvcc..

Now I rebuilt again and now I get another error possibly related to the new Eigen version that was introduced magically as your deps.txt didn't point to a fixed file. I blindly updated the hash to "actual" yesterday in my fork and this version causes an error:

In file included from /src/onnxruntime/build_gpu/Linux/Release/_deps/eigen-src/unsupported/Eigen/CXX11/Tensor:59, from /src/onnxruntime/include/onnxruntime/core/common/eigen_common_wrapper.h:64, from /src/onnxruntime/onnxruntime/core/common/threadpool.cc:22: /src/onnxruntime/build_gpu/Linux/Release/_deps/eigen-src/unsupported/Eigen/CXX11/src/Tensor/TensorMeta.h:266:91: error: ‘First’ was not declared in this scope; did you mean ‘first’? 266 | array<Index, 1 + sizeof...(Is)> customIndices2Array(IndexType& idx, numeric_list<Index, First, Is...>) { | ^~~~~ | first

My guess is that main branch already has a fix for this, so I will merge it into my fork again.

jywu-msft commented 11 months ago

I've had so many different issues with this. I'm still not even sure that I was able to enable more memory to docker. I put a free -m in the docker run step but apparently it reports the memory in the host, not in the docker image. So now I'm running with a hard limit nvcc_thread = 1 I hacked into your build.py in my fork for testing. Before that the latest run ended with this:

Finished fetching external dependencies NVCC_ERROR = nvcc fatal : Unknown option '-Wstrict-aliasing' NVCC_OUT = 1

This is the last output of the stderr stream of the cmake config step, Cmake then returns 0 as build.py continues with the cmake --build step which hangs without any output for 8 hours, despite the fact that I give a 2 hour timeout to Python's subprocess.run, so cmake seems not to be responding to signals while building. Its fairly frustrating.

Which I have asked about before. It would seem that we have a too-old nvcc, but that doesn't explain why it hangs. Our docker image starts "FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04". We then explicitly install gcc 11.2. I hope they haven't removed this option. We've had strange errors before where it turned out that the problem was not actually in the cuda compiler itself but when it launched the host compiler.

From this: https://itecnote.com/tecnote/c-nvcc-strange-interaction-with-xcompiler/ it seems that the option may be directed to the wrong compiler. Well, I grepped for strict-alias in your .py and .cmake files and came up with a few no-strict-aliases and a few strict-aliases in nlohmann_json's cmake files but I hope you are not compiling json parsers with nvcc..

Now I rebuilt again and now I get another error possibly related to the new Eigen version that was introduced magically as your deps.txt didn't point to a fixed file. I blindly updated the hash to "actual" yesterday in my fork and this version causes an error:

In file included from /src/onnxruntime/build_gpu/Linux/Release/_deps/eigen-src/unsupported/Eigen/CXX11/Tensor:59, from /src/onnxruntime/include/onnxruntime/core/common/eigen_common_wrapper.h:64, from /src/onnxruntime/onnxruntime/core/common/threadpool.cc:22: /src/onnxruntime/build_gpu/Linux/Release/_deps/eigen-src/unsupported/Eigen/CXX11/src/Tensor/TensorMeta.h:266:91: error: ‘First’ was not declared in this scope; did you mean ‘first’? 266 | array<Index, 1 + sizeof...(Is)> customIndices2Array(IndexType& idx, numeric_list<Index, First, Is...>) { | ^~~~~ | first

My guess is that main branch already has a fix for this, so I will merge it into my fork again.

i don't think your nvcc is too old. cuda 11.8 is fully supported. as i mentioned previously, can you try using our dockerfile as reference https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/github/linux/docker/Dockerfile.ubuntu_cuda11_8_tensorrt8_6 it is what we use for our CI pipelines. first you can confirm that it builds ok in your environment. afterwards you can make modifications and see where things are breaking.

BengtGustafsson commented 10 months ago

I finally got this working. I can't really tell at this point what change made it work, too much went on in the process.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

BengtGustafsson commented 2 weeks ago

Works now.