tensorflow / tensorrt

TensorFlow/TensorRT integration
Apache License 2.0
736 stars 226 forks source link

Crash in TensorRT during convert the ONNX model to PLAN in parallel. #277

Closed duchy closed 3 years ago

duchy commented 3 years ago

Below crash happed when try to parse the ONNX models to plan format in multi-thread environment. Depends: Tensorrt: 7.2.2.3 Cuda: 11.1 OS: Windows 10 Pro

TensorRT Interface: Create the parser with: TENSORRTAPI IParser* createParser(nvinfer1::INetworkDefinition& network, nvinfer1::ILogger& logger) And parse onnx buffer with: bool nvonnxparser::IParser::parse(const void *serialized_onnx_model, size_t serialized_onnx_model_size)

Below is the crash logs:

CONTEXT:  (.ecxr)
rax=0000000000000000 rbx=00000000c0000374 rcx=0000000000000000
rdx=000000ad7edf8360 rsi=0000000000000001 rdi=00007ffb940d77f0
rip=00007ffb9406f0b9 rsp=000000ad7edf8960 rbp=0000000000000000
 r8=00007ffb11f9ff98  r9=000001dcf7b306c0 r10=00007ffb11f9d3d7
r11=000000ad7edf7ca0 r12=0000000000000000 r13=000001dc0abf72d0
r14=000001dc0abf72c0 r15=0000000000000001
iopl=0         nv up ei pl nz na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206
ntdll!RtlReportFatalFailure+0x9:
00007ffb`9406f0b9 eb00            jmp     ntdll!RtlReportFatalFailure+0xb (00007ffb`9406f0bb)
Resetting default scope

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ffb9406f0b9 (ntdll!RtlReportFatalFailure+0x0000000000000009)
   ExceptionCode: c0000374
  ExceptionFlags: 00000001
NumberParameters: 1
   Parameter[0]: 00007ffb940d77f0

PROCESS_NAME:  Application.exe

ERROR_CODE: (NTSTATUS) 0xc0000374 - <Unable to get error code text>

EXCEPTION_CODE_STR:  c0000374

EXCEPTION_PARAMETER1:  00007ffb940d77f0

STACK_TEXT:  
000000ad`7edf8960 00007ffb`9406f083     : 00007ffb`46145c7c 000000ad`7edff2c0 000001dd`4c1b4c90 00007ffb`00000006 : ntdll!RtlReportFatalFailure+0x9
000000ad`7edf89b0 00007ffb`94077e02     : 00000000`00000005 00007ffb`940d77f0 00000000`00000003 000001dc`0e450000 : ntdll!RtlReportCriticalFailure+0x97
000000ad`7edf8aa0 00007ffb`940780ea     : 00000000`00000003 00000000`00000000 000001dc`0e450000 00007ffb`11e1ccfc : ntdll!RtlpHeapHandleError+0x12
000000ad`7edf8ad0 00007ffb`9407dd71     : 000001dc`0e450000 000001dc`0e450000 000000ad`7edfbb70 00000000`00000007 : ntdll!RtlpHpHeapHandleError+0x7a
000000ad`7edf8b00 00007ffb`94017102     : 000001dc`0abf74b0 000000ad`7edf8bb9 000000ad`7edf8d80 00000000`0000000f : ntdll!RtlpLogHeapFailure+0x45
000000ad`7edf8b30 00007ffb`93f947b1     : 0000232f`c687da2e 000001dc`0e450000 000000ad`7edf8d90 00000000`00000000 : ntdll!RtlpFreeHeapInternal+0x819f2
000000ad`7edf8bf0 00007ffb`13039a94     : 000000ad`7edf8db0 000001dd`14dd2c40 000001dc`8d1dc010 00007ffb`122159d6 : ntdll!RtlFreeHeap+0x51
000000ad`7edf8c30 00007ffb`11e18ca1     : 000001dd`14dd2c40 000001dc`00000000 00000000`0000003f 000001dc`8d1e0c00 : nvinfer!cask_trt::WeightGradientShader::isNhwcOutput+0x38c004
000000ad`7edf8c60 00007ffb`11e7c876     : 000001dc`8d1dc010 00000000`0000001f 000000ad`7edf8d90 000001dc`0b27ec88 : nvinfer+0xb8ca1
000000ad`7edf8c90 00007ffb`11e7dd3b     : 00000000`00000000 000001dc`0b27b328 000000ad`7edf9100 000001dc`0b27b328 : nvinfer!cask_trt::ShaderList<cask_trt::LinkableConvShader,cask_trt::Convolution>::end+0x1a56
000000ad`7edf9000 00007ffb`11f2b512     : 000001dd`14dd2ac0 00007ffb`93f947b1 000001dc`f79d5c80 000001dd`14dd2ac0 : nvinfer!cask_trt::TensorDesc::getDim+0x97b
000000ad`7edf9910 00007ffb`11eb1062     : 000000ad`7edf99f0 000001dc`f7b20b30 000001dd`14dd2ac0 000000ad`7edf9c70 : nvinfer!cask_trt::PoolingShader::outputScalarsPerElement+0x3ca92
000000ad`7edf9990 00007ffb`11eb96bd     : 000000ad`7edfa918 00000000`00000010 000001dc`f7b2fa50 000001dc`f7b2fa50 : nvinfer!cask_trt::TensorDesc::getDim+0x33ca2
000000ad`7edfa860 00007ffb`11fb12d6     : 00007ffb`15880168 000000ad`7edfac60 000001dd`14dd2fa0 000000ad`7edfab90 : nvinfer!cask_trt::TensorDesc::getDim+0x3c2fd
000000ad`7edfaa70 00007ffb`11fafd4e     : 000000ad`7edfadc0 000001dc`f793f7a8 000001dc`f793f7a8 000001dc`f7d34700 : nvinfer!cask_trt::Shader::getKernelInfo+0x22946
000000ad`7edfad90 00007ffb`11ebba5b     : 000000ad`7edfb7c0 000000ad`7edfd3e0 000000ad`7edfd3e0 000000ad`7edfd3e0 : nvinfer!cask_trt::Shader::getKernelInfo+0x213be
000000ad`7edfaea0 00007ffb`11eac95b     : 00007ffb`15880168 00000404`95bb5b00 00000404`95bb5b00 000000ad`7edfd3e0 : nvinfer!cask_trt::TensorDesc::getDim+0x3e69b
000000ad`7edfd180 00007ffb`11e6e97f     : 000000ad`00000000 000001dc`4f322640 000001dc`7e328910 00007ffb`93f95ba1 : nvinfer!cask_trt::TensorDesc::getDim+0x2f59b
000000ad`7edfd370 00007ffb`11e6e8c4     : 000001dc`07ede430 000001dc`4f322510 000001dc`7e328910 000001dc`7e328910 : nvinfer!nvinfer1EnableInternalBuildFlags+0x316f
000000ad`7edfd3b0 00007ffb`45bfab08     : 000001dc`4f322510 00000000`00000000 000001dc`07f403b0 000000ad`7edfdb69 : nvinfer!nvinfer1EnableInternalBuildFlags+0x30b4

tensorrt_crash.txt

duchy commented 3 years ago

Below crash happed when try to parse the ONNX models to plan format in multi-thread environment. Depends: Tensorrt: 7.2.2.3 Cuda: 11.1 OS: Windows 10 Pro

TensorRT Interface: Create the parser with: TENSORRTAPI IParser* createParser(nvinfer1::INetworkDefinition& network, nvinfer1::ILogger& logger) And parse onnx buffer with: bool nvonnxparser::IParser::parse(const void *serialized_onnx_model, size_t serialized_onnx_model_size)

Below is the crash logs:

CONTEXT:  (.ecxr)
rax=0000000000000000 rbx=00000000c0000374 rcx=0000000000000000
rdx=000000ad7edf8360 rsi=0000000000000001 rdi=00007ffb940d77f0
rip=00007ffb9406f0b9 rsp=000000ad7edf8960 rbp=0000000000000000
 r8=00007ffb11f9ff98  r9=000001dcf7b306c0 r10=00007ffb11f9d3d7
r11=000000ad7edf7ca0 r12=0000000000000000 r13=000001dc0abf72d0
r14=000001dc0abf72c0 r15=0000000000000001
iopl=0         nv up ei pl nz na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206
ntdll!RtlReportFatalFailure+0x9:
00007ffb`9406f0b9 eb00            jmp     ntdll!RtlReportFatalFailure+0xb (00007ffb`9406f0bb)
Resetting default scope

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ffb9406f0b9 (ntdll!RtlReportFatalFailure+0x0000000000000009)
   ExceptionCode: c0000374
  ExceptionFlags: 00000001
NumberParameters: 1
   Parameter[0]: 00007ffb940d77f0

PROCESS_NAME:  Application.exe

ERROR_CODE: (NTSTATUS) 0xc0000374 - <Unable to get error code text>

EXCEPTION_CODE_STR:  c0000374

EXCEPTION_PARAMETER1:  00007ffb940d77f0

STACK_TEXT:  
000000ad`7edf8960 00007ffb`9406f083     : 00007ffb`46145c7c 000000ad`7edff2c0 000001dd`4c1b4c90 00007ffb`00000006 : ntdll!RtlReportFatalFailure+0x9
000000ad`7edf89b0 00007ffb`94077e02     : 00000000`00000005 00007ffb`940d77f0 00000000`00000003 000001dc`0e450000 : ntdll!RtlReportCriticalFailure+0x97
000000ad`7edf8aa0 00007ffb`940780ea     : 00000000`00000003 00000000`00000000 000001dc`0e450000 00007ffb`11e1ccfc : ntdll!RtlpHeapHandleError+0x12
000000ad`7edf8ad0 00007ffb`9407dd71     : 000001dc`0e450000 000001dc`0e450000 000000ad`7edfbb70 00000000`00000007 : ntdll!RtlpHpHeapHandleError+0x7a
000000ad`7edf8b00 00007ffb`94017102     : 000001dc`0abf74b0 000000ad`7edf8bb9 000000ad`7edf8d80 00000000`0000000f : ntdll!RtlpLogHeapFailure+0x45
000000ad`7edf8b30 00007ffb`93f947b1     : 0000232f`c687da2e 000001dc`0e450000 000000ad`7edf8d90 00000000`00000000 : ntdll!RtlpFreeHeapInternal+0x819f2
000000ad`7edf8bf0 00007ffb`13039a94     : 000000ad`7edf8db0 000001dd`14dd2c40 000001dc`8d1dc010 00007ffb`122159d6 : ntdll!RtlFreeHeap+0x51
000000ad`7edf8c30 00007ffb`11e18ca1     : 000001dd`14dd2c40 000001dc`00000000 00000000`0000003f 000001dc`8d1e0c00 : nvinfer!cask_trt::WeightGradientShader::isNhwcOutput+0x38c004
000000ad`7edf8c60 00007ffb`11e7c876     : 000001dc`8d1dc010 00000000`0000001f 000000ad`7edf8d90 000001dc`0b27ec88 : nvinfer+0xb8ca1
000000ad`7edf8c90 00007ffb`11e7dd3b     : 00000000`00000000 000001dc`0b27b328 000000ad`7edf9100 000001dc`0b27b328 : nvinfer!cask_trt::ShaderList<cask_trt::LinkableConvShader,cask_trt::Convolution>::end+0x1a56
000000ad`7edf9000 00007ffb`11f2b512     : 000001dd`14dd2ac0 00007ffb`93f947b1 000001dc`f79d5c80 000001dd`14dd2ac0 : nvinfer!cask_trt::TensorDesc::getDim+0x97b
000000ad`7edf9910 00007ffb`11eb1062     : 000000ad`7edf99f0 000001dc`f7b20b30 000001dd`14dd2ac0 000000ad`7edf9c70 : nvinfer!cask_trt::PoolingShader::outputScalarsPerElement+0x3ca92
000000ad`7edf9990 00007ffb`11eb96bd     : 000000ad`7edfa918 00000000`00000010 000001dc`f7b2fa50 000001dc`f7b2fa50 : nvinfer!cask_trt::TensorDesc::getDim+0x33ca2
000000ad`7edfa860 00007ffb`11fb12d6     : 00007ffb`15880168 000000ad`7edfac60 000001dd`14dd2fa0 000000ad`7edfab90 : nvinfer!cask_trt::TensorDesc::getDim+0x3c2fd
000000ad`7edfaa70 00007ffb`11fafd4e     : 000000ad`7edfadc0 000001dc`f793f7a8 000001dc`f793f7a8 000001dc`f7d34700 : nvinfer!cask_trt::Shader::getKernelInfo+0x22946
000000ad`7edfad90 00007ffb`11ebba5b     : 000000ad`7edfb7c0 000000ad`7edfd3e0 000000ad`7edfd3e0 000000ad`7edfd3e0 : nvinfer!cask_trt::Shader::getKernelInfo+0x213be
000000ad`7edfaea0 00007ffb`11eac95b     : 00007ffb`15880168 00000404`95bb5b00 00000404`95bb5b00 000000ad`7edfd3e0 : nvinfer!cask_trt::TensorDesc::getDim+0x3e69b
000000ad`7edfd180 00007ffb`11e6e97f     : 000000ad`00000000 000001dc`4f322640 000001dc`7e328910 00007ffb`93f95ba1 : nvinfer!cask_trt::TensorDesc::getDim+0x2f59b
000000ad`7edfd370 00007ffb`11e6e8c4     : 000001dc`07ede430 000001dc`4f322510 000001dc`7e328910 000001dc`7e328910 : nvinfer!nvinfer1EnableInternalBuildFlags+0x316f
000000ad`7edfd3b0 00007ffb`45bfab08     : 000001dc`4f322510 00000000`00000000 000001dc`07f403b0 000000ad`7edfdb69 : nvinfer!nvinfer1EnableInternalBuildFlags+0x30b4

tensorrt_crash.txt

Is it a race issue? The model can be parsed successfully one by one from onnx file paths in a single thread.

duchy commented 3 years ago

Repost this issue to Nvidia/Tensorrt.