[Build] Build onnxruntime for tensorrt failed on rtx 4060 gpu

irrikrlla commented 1 week ago

Describe the issue

I got large-scale test failures in the test process. failure log is provided as the following full-log.

I've tried the TensorRT version from 10.2 to 10.5 to build onnxruntime-gpu with TensorRT enabled, but no one has succeeded. LastTest.log

Urgency

No response

Target platform

windows11 x64, RTX4060 with cuda 12.4

Build script

.\build.bat --parallel --cmake_generator "Visual Studio 17 2022" --use_tensorrt --cudnn_home "C:\Program Files\NVIDIA GPU Computing Toolkit\cudnn-cuda12-archive" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" --tensorrt_home "C:\Program Files\NVIDIA GPU Computing Toolkit\TensorRT-10.2.0.19" --build_wheel

Error / output

71% tests passed, 2 tests failed out of 7

Total Test time (real) = 1291.41 sec

The following tests FAILED: 1 - onnxruntime_test_all (Failed) 4 - onnxruntime_shared_lib_test (Failed) Errors while running CTest Output from these tests are in: D:/onnxruntime/build/Windows/Debug/Testing/Temporary/LastTest.log Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely. Traceback (most recent call last): File "D:\onnxruntime\tools\ci_build\build.py", line 2977, in sys.exit(main()) ^^^^^^ File "D:\onnxruntime\tools\ci_build\build.py", line 2874, in main run_onnxruntime_tests(args, source_dir, ctest_path, build_dir, configs) File "D:\onnxruntime\tools\ci_build\build.py", line 2062, in run_onnxruntime_tests run_subprocess(ctest_cmd, cwd=cwd, dll_path=dll_path) File "D:\onnxruntime\tools\ci_build\build.py", line 866, in run_subprocess return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\onnxruntime\tools\python\util\run.py", line 49, in run completed_process = subprocess.run( ^^^^^^^^^^^^^^^ File "D:\Python311\Lib\subprocess.py", line 571, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['D:\Python311\Scripts\ctest.EXE', '--build-config', 'Debug', '--verbose', '--timeout', '10800']' returned non-zero exit status 8.

Visual Studio Version

17.11.4

GCC / Compiler Version

cmake 30.4; MSVC 14.40.33807

jywu-msft commented 1 week ago

can you build with --skip_tests to work around the test failure and allow your build to succeed.

i see at least one test failure related to epcontext. seems like a mismatch in engine. seems like it's trying to deserialize an engine with a higher version than the TRT version used? which version of TensorRT did you build with and what version is in your PATH?

@chilo-ms , that's what the below msg means, right?

[ RUN ] TensorrtExecutionProviderTest.EPContextNode [1;31m2024-10-10 09:10:00.8354002 [E:onnxruntime:Default, tensorrt_execution_provider.h:88 onnxruntime::TensorrtLogger::log] [2024-10-10 01:10:00 ERROR] IRuntime::deserializeCudaEngine: Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 238, Serialized Engine Version: 239)[m D:\onnxruntime\onnxruntime\test\providers\tensorrt\tensorrt_basic_test.cc(456): error: Value of: status.IsOK() Actual: false Expected: true Stack trace: 00007FF69BEB32F2: onnxruntime::test::TensorrtExecutionProviderTest_EPContextNode_Test::TestBody 00007FF69ECB641D: testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test,void> 00007FF69ECB6053: testing::internal::HandleExceptionsInMethodIfSupported<testing::Test,void> 00007FF69EC83CDC: testing::Test::Run 00007FF69EC84994: testing::TestInfo::Run ... Google Test internal frames ...

[ FAILED ] TensorrtExecutionProviderTest.EPContextNode (563 ms)

chilo-ms commented 1 week ago

@chilo-ms , that's what the below msg means, right?

That's right. It's likely that the test has been run many times with different version of TensorRT, and one of the engine cache created by one version of TRT is used by another version of TRT in another test run.

As @jywu-msft suggested that you can simply add --skip_tests to skip test, or make sure every build is a clean build, so that the test won't use the "old" engine cache.

chilo-ms commented 1 week ago

Please note that the engine cache created by TRT EP has the TRT version hashed into the hash value in the name, e.g.: TensorrtExecutionProvider_TRTKernel_graph_mxnet_converted_model_12951946405796672126_0_0_sm80.engine

TRT EP will select/use the specific engine cache based on the information such as model/CUDA version/TRT version ... this way it can prevent using the "wrong" engine cache created by another version of TRT.

TRT EP uses the TRT version which we/users built against at compile time. However, users can change different TRT version at run time, that can cause issue because TRT EP always check the "fixed" TRT version, not the TRT version it uses now. @jywu-msft

microsoft / onnxruntime