triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.71k stars 1.42k forks source link

unable to load model.plan in nvidia triton #6064

Closed imenselmi closed 11 months ago

imenselmi commented 11 months ago

i'm using : $ nvidia-smi Sun Jul 16 10:12:52 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A | | N/A 38C P0 5W / N/A | 20MiB / 4096MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1281 G /usr/lib/xorg/Xorg 14MiB | | 0 N/A N/A 2545 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+ $ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0 $ docker --version Docker version 24.0.4, build 3713ee1

I am attempting to execute the detectron2 model.plan using NVIDIA Triton, but I encountered the following error: I0716 09:18:42.937501 1 server.cc:588] +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+ | tensorflow | /opt/tritonserver/backends/tensorflow2/libtriton_tensorflow2.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} | | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0716 09:18:42.937541 1 server.cc:631] +----------------------+---------+--------+ | Model | Version | Status | +----------------------+---------+--------+ | densenet_onnx | 1 | READY | | inception_graphdef | 1 | READY | | simple | 1 | READY | | simple_dyna_sequence | 1 | READY | | simple_identity | 1 | READY | | simple_int8 | 1 | READY | | simple_sequence | 1 | READY | | simple_string | 1 | READY | +----------------------+---------+--------+

I0716 09:18:42.967532 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce GTX 1650 Ti I0716 09:18:42.967797 1 tritonserver.cc:2214] +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.25.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace | | model_repository_path[0] | /models | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | response_cache_byte_size | 0 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0716 09:18:42.967806 1 server.cc:262] Waiting for in-flight requests to complete. I0716 09:18:42.967816 1 server.cc:278] Timeout 30: Found 0 model versions that have in-flight inferences I0716 09:18:42.968091 1 tensorflow.cc:2729] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0716 09:18:42.968142 1 tensorflow.cc:2729] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0716 09:18:42.968153 1 tensorflow.cc:2729] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0716 09:18:42.968157 1 tensorflow.cc:2668] TRITONBACKEND_ModelFinalize: delete model state I0716 09:18:42.968181 1 tensorflow.cc:2668] TRITONBACKEND_ModelFinalize: delete model state I0716 09:18:42.968210 1 server.cc:293] All models are stopped, unloading models I0716 09:18:42.968190 1 tensorflow.cc:2729] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0716 09:18:42.968200 1 tensorflow.cc:2668] TRITONBACKEND_ModelFinalize: delete model state I0716 09:18:42.968241 1 tensorflow.cc:2729] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0716 09:18:42.968184 1 tensorflow.cc:2729] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0716 09:18:42.968253 1 model_lifecycle.cc:578] successfully unloaded 'simple_int8' version 1 I0716 09:18:42.968226 1 server.cc:300] Timeout 30: Found 8 live models and 0 in-flight non-inference requests I0716 09:18:42.968282 1 tensorflow.cc:2668] TRITONBACKEND_ModelFinalize: delete model state I0716 09:18:42.968254 1 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0716 09:18:42.968331 1 tensorflow.cc:2729] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0716 09:18:42.968351 1 model_lifecycle.cc:578] successfully unloaded 'simple_string' version 1 I0716 09:18:42.968392 1 tensorflow.cc:2729] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0716 09:18:42.968405 1 tensorflow.cc:2668] TRITONBACKEND_ModelFinalize: delete model state I0716 09:18:42.968411 1 model_lifecycle.cc:578] successfully unloaded 'simple' version 1 I0716 09:18:42.968412 1 model_lifecycle.cc:578] successfully unloaded 'simple_identity' version 1 I0716 09:18:42.968424 1 tensorflow.cc:2729] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0716 09:18:42.968420 1 tensorflow.cc:2668] TRITONBACKEND_ModelFinalize: delete model state I0716 09:18:42.968449 1 tensorflow.cc:2668] TRITONBACKEND_ModelFinalize: delete model state I0716 09:18:42.968605 1 model_lifecycle.cc:578] successfully unloaded 'simple_sequence' version 1 I0716 09:18:42.968903 1 model_lifecycle.cc:578] successfully unloaded 'simple_dyna_sequence' version 1 I0716 09:18:42.971017 1 model_lifecycle.cc:578] successfully unloaded 'inception_graphdef' version 1 I0716 09:18:42.972894 1 onnxruntime.cc:2586] TRITONBACKEND_ModelFinalize: delete model state I0716 09:18:42.972920 1 model_lifecycle.cc:578] successfully unloaded 'densenet_onnx' version 1 I0716 09:18:43.968419 1 server.cc:300] Timeout 29: Found 0 live models and 0 in-flight non-inference requests W0716 09:18:43.981524 1 metrics.cc:426] Unable to get power limit for GPU 0. Status:Success, value:0.000000 error: creating server: Internal - failed to load all models W0716 09:18:44.982038 1 metrics.cc:426] Unable to get power limit for GPU 0. Status:Success, value:0.000000


I followed all the necessary steps and ensured the correct hierarchy:

config.pbtxt: name: "detectron2" platform: "tensorrt_plan" max_batch_size: 1 input [ { name: "input_tensor" data_type: TYPE_FP32 dims: [ 3, 1344, 1344 ] } ] output [ { name: "detection_boxes_box_outputs" data_type: TYPE_FP32 dims: [ 100,4] }, { name: "detection_classes_box_outputs" data_type: TYPE_INT32 dims: [-1] }, { name: "detection_masks" data_type: TYPE_FP32 dims: [ 100,28,28] }, { name: "detection_scores_box_outputs" data_type: TYPE_FP32 dims: [ -1] }, { name: "num_detections_box_outputs" data_type: TYPE_INT32 dims: [ -1] } ] instance_group [ { kind: KIND_GPU count: 1 } ]

Screenshot from 2023-07-16 10-20-12


When I run the server with a model that doesn't use TensorRT (TRT), it works without any issues. However, when I added the model.plan, which is optimized with TensorRT, the server encounters problems and fails to run properly!

Tabrizian commented 11 months ago

TensorRT has very tight requirements for compatibility for the place where the model generated and should be run. You need to make sure to use the exact same TRT version and GPU for both generation and running the TRT model. The best way might be to just generate the model the nvcr.io/nvidia/tensorrt:<xx.yy>-py3 and run it in the same version of Triton otherwise you'd run into compatibility issues.

dyastremsky commented 11 months ago

Closing due to inactivity. If you would like to reopen this issue for follow-up, please let us know.

imenselmi commented 11 months ago

Sorry for not replying earlier. I tried to fix it with the correct version using Docker, but I faced some problems with the Docker container. I really want to fix it because it's a crucial part of my end-of-study project, and it's essential to me that this part works correctly:

Environment

TensorRT docker version version: 22.08

TensorRT version: 8.4.2.

NVIDIA GPU:NVIDIA GeForce GTX 1650 Ti GPU

Memory: 15.4 GiB

NVIDIA Driver Version:520.61.05

CUDA Version:11.08

Operating System:Ubuntu 20.04 LTS

Python Version:3.8.10

PyTorch Version in docker container : 1.9.0

ONNX Version in docker container: 1.9.0

ONNX Runtime Version in docker container: 1.8.1

The error : &&&& RUNNING TensorRT.trtexec [TensorRT v8402] # tensorrt/bin/trtexec --onnx=/models/converted.onnx --saveEngine=engine.trt --useCudaGraph [08/02/2023-19:24:24] [I] === Model Options === [08/02/2023-19:24:24] [I] Format: ONNX [08/02/2023-19:24:24] [I] Model: /models/converted.onnx [08/02/2023-19:24:24] [I] Output: [08/02/2023-19:24:24] [I] === Build Options === [08/02/2023-19:24:24] [I] Max batch: explicit batch [08/02/2023-19:24:24] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default [08/02/2023-19:24:24] [I] minTiming: 1 [08/02/2023-19:24:24] [I] avgTiming: 8 [08/02/2023-19:24:24] [I] Precision: FP32 [08/02/2023-19:24:24] [I] LayerPrecisions: [08/02/2023-19:24:24] [I] Calibration: [08/02/2023-19:24:24] [I] Refit: Disabled [08/02/2023-19:24:24] [I] Sparsity: Disabled [08/02/2023-19:24:24] [I] Safe mode: Disabled [08/02/2023-19:24:24] [I] DirectIO mode: Disabled [08/02/2023-19:24:24] [I] Restricted mode: Disabled [08/02/2023-19:24:24] [I] Build only: Disabled [08/02/2023-19:24:24] [I] Save engine: engine.trt [08/02/2023-19:24:24] [I] Load engine: [08/02/2023-19:24:24] [I] Profiling verbosity: 0 [08/02/2023-19:24:24] [I] Tactic sources: Using default tactic sources [08/02/2023-19:24:24] [I] timingCacheMode: local [08/02/2023-19:24:24] [I] timingCacheFile: [08/02/2023-19:24:24] [I] Input(s)s format: fp32:CHW [08/02/2023-19:24:24] [I] Output(s)s format: fp32:CHW [08/02/2023-19:24:24] [I] Input build shapes: model [08/02/2023-19:24:24] [I] Input calibration shapes: model [08/02/2023-19:24:24] [I] === System Options === [08/02/2023-19:24:24] [I] Device: 0 [08/02/2023-19:24:24] [I] DLACore: [08/02/2023-19:24:24] [I] Plugins: [08/02/2023-19:24:24] [I] === Inference Options === [08/02/2023-19:24:24] [I] Batch: Explicit [08/02/2023-19:24:24] [I] Input inference shapes: model [08/02/2023-19:24:24] [I] Iterations: 10 [08/02/2023-19:24:24] [I] Duration: 3s (+ 200ms warm up) [08/02/2023-19:24:24] [I] Sleep time: 0ms [08/02/2023-19:24:24] [I] Idle time: 0ms [08/02/2023-19:24:24] [I] Streams: 1 [08/02/2023-19:24:24] [I] ExposeDMA: Disabled [08/02/2023-19:24:24] [I] Data transfers: Enabled [08/02/2023-19:24:24] [I] Spin-wait: Disabled [08/02/2023-19:24:24] [I] Multithreading: Disabled [08/02/2023-19:24:24] [I] CUDA Graph: Enabled [08/02/2023-19:24:24] [I] Separate profiling: Disabled [08/02/2023-19:24:24] [I] Time Deserialize: Disabled [08/02/2023-19:24:24] [I] Time Refit: Disabled [08/02/2023-19:24:24] [I] Inputs: [08/02/2023-19:24:24] [I] === Reporting Options === [08/02/2023-19:24:24] [I] Verbose: Disabled [08/02/2023-19:24:24] [I] Averages: 10 inferences [08/02/2023-19:24:24] [I] Percentile: 99 [08/02/2023-19:24:24] [I] Dump refittable layers:Disabled [08/02/2023-19:24:24] [I] Dump output: Disabled [08/02/2023-19:24:24] [I] Profile: Disabled [08/02/2023-19:24:24] [I] Export timing to JSON file: [08/02/2023-19:24:24] [I] Export output to JSON file: [08/02/2023-19:24:24] [I] Export profile to JSON file: [08/02/2023-19:24:24] [I] [08/02/2023-19:24:24] [I] === Device Information === [08/02/2023-19:24:24] [I] Selected Device: NVIDIA GeForce GTX 1650 Ti [08/02/2023-19:24:24] [I] Compute Capability: 7.5 [08/02/2023-19:24:24] [I] SMs: 16 [08/02/2023-19:24:24] [I] Compute Clock Rate: 1.485 GHz [08/02/2023-19:24:24] [I] Device Global Memory: 3912 MiB [08/02/2023-19:24:24] [I] Shared Memory per SM: 64 KiB [08/02/2023-19:24:24] [I] Memory Bus Width: 128 bits (ECC disabled) [08/02/2023-19:24:24] [I] Memory Clock Rate: 6.001 GHz [08/02/2023-19:24:24] [I] [08/02/2023-19:24:24] [I] TensorRT version: 8.4.2 [08/02/2023-19:24:24] [I] [TRT] [MemUsageChange] Init CUDA: CPU +311, GPU +0, now: CPU 319, GPU 230 (MiB) [08/02/2023-19:24:25] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +207, GPU +68, now: CPU 545, GPU 298 (MiB) [08/02/2023-19:24:25] [I] Start parsing network model [08/02/2023-19:24:26] [I] [TRT] ---------------------------------------------------------------- [08/02/2023-19:24:26] [I] [TRT] Input filename: /models/converted.onnx [08/02/2023-19:24:26] [I] [TRT] ONNX IR version: 0.0.9 [08/02/2023-19:24:26] [I] [TRT] Opset version: 11 [08/02/2023-19:24:26] [I] [TRT] Producer name: pytorch [08/02/2023-19:24:26] [I] [TRT] Producer version: 2.0.1 [08/02/2023-19:24:26] [I] [TRT] Domain:
[08/02/2023-19:24:26] [I] [TRT] Model version: 0 [08/02/2023-19:24:26] [I] [TRT] Doc string:
[08/02/2023-19:24:26] [I] [TRT] ---------------------------------------------------------------- [08/02/2023-19:24:26] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [08/02/2023-19:24:26] [I] [TRT] No importer registered for op: EfficientNMS_TRT. Attempting to import as plugin. [08/02/2023-19:24:26] [I] [TRT] Searching for plugin: EfficientNMS_TRT, plugin_version: 1, plugin_namespace: [08/02/2023-19:24:26] [I] [TRT] Successfully created plugin: EfficientNMS_TRT [08/02/2023-19:24:26] [I] [TRT] No importer registered for op: PyramidROIAlign_TRT. Attempting to import as plugin. [08/02/2023-19:24:26] [I] [TRT] Searching for plugin: PyramidROIAlign_TRT, plugin_version: 1, plugin_namespace: [08/02/2023-19:24:26] [W] [TRT] parsers/onnx/builtin_op_importers.cpp:4714: Attribute roi_coords_plusone not found in plugin node! Ensure that the plugin creator has a default value defined or the engine may fail to build. [08/02/2023-19:24:26] [W] [TRT] parsers/onnx/builtin_op_importers.cpp:4714: Attribute legacy not found in plugin node! Ensure that the plugin creator has a default value defined or the engine may fail to build. [08/02/2023-19:24:26] [I] [TRT] Successfully created plugin: PyramidROIAlign_TRT [08/02/2023-19:24:26] [I] [TRT] No importer registered for op: EfficientNMS_TRT. Attempting to import as plugin. [08/02/2023-19:24:26] [I] [TRT] Searching for plugin: EfficientNMS_TRT, plugin_version: 1, plugin_namespace: [08/02/2023-19:24:26] [I] [TRT] Successfully created plugin: EfficientNMS_TRT [08/02/2023-19:24:26] [I] [TRT] No importer registered for op: PyramidROIAlign_TRT. Attempting to import as plugin. [08/02/2023-19:24:26] [I] [TRT] Searching for plugin: PyramidROIAlign_TRT, plugin_version: 1, plugin_namespace: [08/02/2023-19:24:26] [W] [TRT] parsers/onnx/builtin_op_importers.cpp:4714: Attribute roi_coords_plusone not found in plugin node! Ensure that the plugin creator has a default value defined or the engine may fail to build. [08/02/2023-19:24:26] [W] [TRT] parsers/onnx/builtin_op_importers.cpp:4714: Attribute legacy not found in plugin node! Ensure that the plugin creator has a default value defined or the engine may fail to build. [08/02/2023-19:24:26] [I] [TRT] Successfully created plugin: PyramidROIAlign_TRT [08/02/2023-19:24:26] [I] Finish parsing network model [08/02/2023-19:24:27] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +499, GPU +216, now: CPU 1240, GPU 522 (MiB) [08/02/2023-19:24:27] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +117, GPU +54, now: CPU 1357, GPU 576 (MiB) [08/02/2023-19:24:27] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. [08/02/2023-19:26:38] [E] Error[2]: [virtualMemoryBuffer.cpp::resizePhysical::144] Error Code 2: OutOfMemory (no further information) [08/02/2023-19:26:38] [E] Error[2]: [virtualMemoryBuffer.cpp::resizePhysical::144] Error Code 2: OutOfMemory (no further information) [08/02/2023-19:26:38] [W] [TRT] -------------- The current system memory allocations dump as below -------------- [08/02/2023-19:26:38] [W] [TRT] [0x670639e0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 19830 time: 6.64e-07 [08/02/2023-19:26:38] [W] [TRT] [0x67067980]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 19827 time: 7.65e-07 [08/02/2023-19:26:38] [W] [TRT] [0x67063560]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 19821 time: 6.3e-07 [08/02/2023-19:26:38] [W] [TRT] [0x67061120]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 19815 time: 1.158e-06

..... .... [08/02/2023-19:26:39] [W] [TRT] [0x30e94800]:4 :: weight zero-point in internalAllocate: at runtime/common/weightsPtr.cpp: 102 idx: 17 time: 4.3e-08 [08/02/2023-19:26:39] [W] [TRT] [0x6704ae30]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 14362 time: 2.53e-07 [08/02/2023-19:26:39] [W] [TRT] [0x6704b5e0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 14377 time: 1.89e-07 [08/02/2023-19:26:39] [W] [TRT] [0x67041840]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 10836 time: 6.33e-07 [08/02/2023-19:26:39] [W] [TRT] [0x6704a2e0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 14693 time: 4.7e-08 [08/02/2023-19:26:39] [W] [TRT] [0x67060b80]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 19416 time: 3.9e-08 [08/02/2023-19:26:39] [W] [TRT] [0x67042650]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 14708 time: 4.7e-08 [08/02/2023-19:26:39] [W] [TRT] [0x2157ac40]:1642496 :Cudnn Builder weights ptr in internalAllocate: at runtime/common/weightsPtr.cpp: 102 idx: 175 time: 3.31e-07 [08/02/2023-19:26:39] [W] [TRT] [0x66ee9c20]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 6348 time: 2.58e-07 [08/02/2023-19:26:39] [W] [TRT] [0x67051510]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 14714 time: 1.69e-07 [08/02/2023-19:26:39] [W] [TRT] [0x67036d80]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 8885 time: 3.33e-07 [08/02/2023-19:26:39] [W] [TRT] [0x6704c810]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 15567 time: 2.73e-07 [08/02/2023-19:26:39] [W] [TRT] [0x66ee6540]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 6336 time: 1.92e-07 [08/02/2023-19:26:39] [W] [TRT] [0x6704de30]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 15573 time: 2.33e-07 [08/02/2023-19:26:39] [W] [TRT] [0x6704f560]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 15576 time: 2.88e-07 [08/02/2023-19:26:39] [W] [TRT] -------------- The current device memory allocations dump as below -------------- [08/02/2023-19:26:39] [W] [TRT] [0]:4294967296 :HybridGlobWriter in reserveMemory: at optimizer/common/globWriter.cpp: 438 idx: 4708 time: 0.0150887 [08/02/2023-19:26:39] [W] [TRT] [0x302000000]:2232418304 :HybridGlobWriter in reserveMemory: at optimizer/common/globWriter.cpp: 416 idx: 3046 time: 0.00204409 [08/02/2023-19:26:39] [W] [TRT] Requested amount of GPU memory (4294967296 bytes) could not be allocated. There may not be enough free memory for allocation to succeed. [08/02/2023-19:26:39] [W] [TRT] Skipping tactic 7 due to insufficient memory on requested size of 4294967296 detected for tactic 0x000000000000003d. Try decreasing the workspace size with IBuilderConfig::setMemoryPoolLimit(). [08/02/2023-19:26:58] [W] [TRT] Skipping tactic 0x0000000000000000 due to Myelin error: autotuning: CUDA error 2 allocating 0-byte buffer: out of memory [08/02/2023-19:26:58] [E] Error[10]: [optimizer.cpp::computeCosts::3626] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[box_outputs/reshape_classes...mask_head/final_reshape]}.) [08/02/2023-19:26:58] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. ) [08/02/2023-19:26:58] [E] Engine could not be created from network [08/02/2023-19:26:58] [E] Building engine failed [08/02/2023-19:26:58] [E] Failed to create engine from model or file. [08/02/2023-19:26:58] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8402] # tensorrt/bin/trtexec --onnx=/models/converted.onnx --saveEngine=engine.trt --useCudaGraph

image

image

image

imenselmi commented 11 months ago

For TensorRT, with the source code, it works and converts the ONNX model to a TRT engine:

[08/02/2023-22:16:01] [I] Timing trace has 29 queries over 3.34663 s [08/02/2023-22:16:01] [I] [08/02/2023-22:16:01] [I] === Trace details === [08/02/2023-22:16:01] [I] Trace averages of 10 runs: [08/02/2023-22:16:01] [I] Average on 10 runs - GPU latency: 111.568 ms - Host latency: 112.975 ms (enqueue 0.173483 ms) [08/02/2023-22:16:01] [I] Average on 10 runs - GPU latency: 111.558 ms - Host latency: 112.866 ms (enqueue 0.201147 ms) [08/02/2023-22:16:01] [I] [08/02/2023-22:16:01] [I] === Performance summary === [08/02/2023-22:16:01] [I] Throughput: 8.66543 qps [08/02/2023-22:16:01] [I] Latency: min = 111.471 ms, max = 115.505 ms, mean = 112.918 ms, median = 112.915 ms, percentile(99%) = 115.505 ms [08/02/2023-22:16:01] [I] Enqueue Time: min = 0.032692 ms, max = 0.217773 ms, mean = 0.191826 ms, median = 0.197388 ms, percentile(99%) = 0.217773 ms [08/02/2023-22:16:01] [I] H2D Latency: min = 1.21436 ms, max = 1.65875 ms, mean = 1.26646 ms, median = 1.2356 ms, percentile(99%) = 1.65875 ms [08/02/2023-22:16:01] [I] GPU Compute Time: min = 110.145 ms, max = 114.097 ms, mean = 111.579 ms, median = 111.534 ms, percentile(99%) = 114.097 ms [08/02/2023-22:16:01] [I] D2H Latency: min = 0.0625 ms, max = 0.0930176 ms, mean = 0.0723519 ms, median = 0.0713501 ms, percentile(99%) = 0.0930176 ms [08/02/2023-22:16:01] [I] Total Host Walltime: 3.34663 s [08/02/2023-22:16:01] [I] Total GPU Compute Time: 3.2358 s [08/02/2023-22:16:01] [I] Explanations of the performance metrics are printed in the verbose logs. [08/02/2023-22:16:01] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8402] # /home/mj/Downloads/TensorRT-8.4.2.4/bin/trtexec --onnx=/home/mj/Documents/pfe/server/docs/examples/model_repository/converted1_batch.onnx --saveEngine=engine.trt --useCudaGraph

image image

but when i run nvidia triton it doesn't work ! :

(base) mj@mj-G5-5500:~/Documents/pfe/server/docs/examples$ sudo docker run --gpus all --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:22.08-py3 tritonserver --model-repository=/models [sudo] password for mj:

============================= == Triton Inference Server ==

NVIDIA Release 22.08 (build 42766143) Triton Server Version 2.25.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I0802 21:19:29.077600 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f34b4000000' with size 268435456 I0802 21:19:29.078684 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I0802 21:19:29.082981 1 model_lifecycle.cc:459] loading: detectron_trt:1 I0802 21:19:29.195216 1 tensorrt.cc:5441] TRITONBACKEND_Initialize: tensorrt I0802 21:19:29.195438 1 tensorrt.cc:5451] Triton TRITONBACKEND API version: 1.10 I0802 21:19:29.195655 1 tensorrt.cc:5457] 'tensorrt' TRITONBACKEND API version: 1.10 I0802 21:19:29.196283 1 tensorrt.cc:5500] backend configuration: {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} I0802 21:19:29.196320 1 tensorrt.cc:5552] TRITONBACKEND_ModelInitialize: detectron_trt (version 1) I0802 21:19:29.733873 1 logging.cc:49] [MemUsageChange] Init CUDA: CPU +303, GPU +0, now: CPU 320, GPU 304 (MiB) I0802 21:19:29.991568 1 logging.cc:49] Loaded engine size: 218 MiB E0802 21:19:29.996602 1 logging.cc:43] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 8.4.2.4 got 8.4.3.1, please rebuild. E0802 21:19:30.017573 1 logging.cc:43] 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)

image

dyastremsky commented 11 months ago

If you look at the error message, your version of TensorRT does not match the version in the 22.08 container. You can see which TensorRT is expected with the 22.08 version of Triton on the release page here.

As Iman mentioned earlier, the easiest way to get the right version of TRT would be to use trtexec in the TensorRT containers (e.g. the 22.08 TRT container to match the 22.08 Triton container): https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt. Looking at the release page, there is no version of Triton which has TRT built for 8.4.3.1., so it wouldn't be as simple as changing the Triton container version.

imenselmi commented 11 months ago

@dyastremsky I fixed the error. Thank you!

dyastremsky commented 11 months ago

Wonderful, happy you found a solution. Thanks for updating us!

imenselmi commented 11 months ago

I solved the issue by adding '--workspace=2000'. It works for me now:

$ sudo docker run --gpus all -it --rm -v ${PWD}/model_repository:/models nvcr.io/nvidia/tensorrt:22.08-py3

then run :

$ tensorrt/bin/trtexec --onnx=/models/converted.onnx --saveEngine=engine.trt --workspace=2000 --useCudaGraph

$ cp engine.trt /models

$ sudo docker run --gpus all --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:22.08-py3 tritonserver --model-repository=/models