triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.04k stars 1.44k forks source link

what's the behavior about python_backend.InferenceRequest.exec()? #5174

Closed zxOnVacation closed 1 year ago

zxOnVacation commented 1 year ago

I wanna use triton+tensorrt to deploy whisper, a transformer-like arc asr model.
I wanna use kv-cache to accelerate inference speed, so i use python-backend and dlpack to do this when i build a decoder to tensorrt, use trtexec to measure the decoder performance as below

trtexec --loadEngine=decoder_128.plan --warmUp=0 --duration=0 --iterations=50
[12/15/2022-12:04:25] [I] === Model Options ===
[12/15/2022-12:04:25] [I] Format: *
[12/15/2022-12:04:25] [I] Model:
[12/15/2022-12:04:25] [I] Output:
[12/15/2022-12:04:25] [I] === Build Options ===
[12/15/2022-12:04:25] [I] Max batch: 1
[12/15/2022-12:04:25] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[12/15/2022-12:04:25] [I] minTiming: 1
[12/15/2022-12:04:25] [I] avgTiming: 8
[12/15/2022-12:04:25] [I] Precision: FP32
[12/15/2022-12:04:25] [I] LayerPrecisions:
[12/15/2022-12:04:25] [I] Calibration:
[12/15/2022-12:04:25] [I] Refit: Disabled
[12/15/2022-12:04:25] [I] Sparsity: Disabled
[12/15/2022-12:04:25] [I] Safe mode: Disabled
[12/15/2022-12:04:25] [I] DirectIO mode: Disabled
[12/15/2022-12:04:25] [I] Restricted mode: Disabled
[12/15/2022-12:04:25] [I] Build only: Disabled
[12/15/2022-12:04:25] [I] Save engine:
[12/15/2022-12:04:25] [I] Load engine: decoder_128.plan
[12/15/2022-12:04:25] [I] Profiling verbosity: 0
[12/15/2022-12:04:25] [I] Tactic sources: Using default tactic sources
[12/15/2022-12:04:25] [I] timingCacheMode: local
[12/15/2022-12:04:25] [I] timingCacheFile:
[12/15/2022-12:04:25] [I] Heuristic: Disabled
[12/15/2022-12:04:25] [I] Preview Features: Use default preview flags.
[12/15/2022-12:04:25] [I] Input(s)s format: fp32:CHW
[12/15/2022-12:04:25] [I] Output(s)s format: fp32:CHW
[12/15/2022-12:04:25] [I] Input build shapes: model
[12/15/2022-12:04:25] [I] Input calibration shapes: model
[12/15/2022-12:04:25] [I] === System Options ===
[12/15/2022-12:04:25] [I] Device: 0
[12/15/2022-12:04:25] [I] DLACore:
[12/15/2022-12:04:25] [I] Plugins:
[12/15/2022-12:04:25] [I] === Inference Options ===
[12/15/2022-12:04:25] [I] Batch: 1
[12/15/2022-12:04:25] [I] Input inference shapes: model
[12/15/2022-12:04:25] [I] Iterations: 50
[12/15/2022-12:04:25] [I] Duration: 0s (+ 0ms warm up)
[12/15/2022-12:04:25] [I] Sleep time: 0ms
[12/15/2022-12:04:25] [I] Idle time: 0ms
[12/15/2022-12:04:25] [I] Streams: 1
[12/15/2022-12:04:25] [I] ExposeDMA: Disabled
[12/15/2022-12:04:25] [I] Data transfers: Enabled
[12/15/2022-12:04:25] [I] Spin-wait: Disabled
[12/15/2022-12:04:25] [I] Multithreading: Disabled
[12/15/2022-12:04:25] [I] CUDA Graph: Disabled
[12/15/2022-12:04:25] [I] Separate profiling: Disabled
[12/15/2022-12:04:25] [I] Time Deserialize: Disabled
[12/15/2022-12:04:25] [I] Time Refit: Disabled
[12/15/2022-12:04:25] [I] NVTX verbosity: 0
[12/15/2022-12:04:25] [I] Persistent Cache Ratio: 0
[12/15/2022-12:04:25] [I] Inputs:
[12/15/2022-12:04:25] [I] === Reporting Options ===
[12/15/2022-12:04:25] [I] Verbose: Disabled
[12/15/2022-12:04:25] [I] Averages: 10 inferences
[12/15/2022-12:04:25] [I] Percentiles: 90,95,99
[12/15/2022-12:04:25] [I] Dump refittable layers:Disabled
[12/15/2022-12:04:25] [I] Dump output: Disabled
[12/15/2022-12:04:25] [I] Profile: Disabled
[12/15/2022-12:04:25] [I] Export timing to JSON file:
[12/15/2022-12:04:25] [I] Export output to JSON file:
[12/15/2022-12:04:25] [I] Export profile to JSON file:
[12/15/2022-12:04:25] [I]
[12/15/2022-12:04:25] [I] === Device Information ===
[12/15/2022-12:04:25] [I] Selected Device: NVIDIA A10
[12/15/2022-12:04:25] [I] Compute Capability: 8.6
[12/15/2022-12:04:25] [I] SMs: 72
[12/15/2022-12:04:25] [I] Compute Clock Rate: 1.695 GHz
[12/15/2022-12:04:25] [I] Device Global Memory: 22731 MiB
[12/15/2022-12:04:25] [I] Shared Memory per SM: 100 KiB
[12/15/2022-12:04:25] [I] Memory Bus Width: 384 bits (ECC enabled)
[12/15/2022-12:04:25] [I] Memory Clock Rate: 6.251 GHz
[12/15/2022-12:04:25] [I]
[12/15/2022-12:04:25] [I] TensorRT version: 8.5.1
[12/15/2022-12:04:25] [I] Engine loaded in 0.272455 sec.
[12/15/2022-12:04:25] [I] [TRT] Loaded engine size: 344 MiB
[12/15/2022-12:04:27] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +853, GPU +360, now: CPU 1687, GPU 1114 (MiB)
[12/15/2022-12:04:27] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +125, GPU +58, now: CPU 1812, GPU 1172 (MiB)
[12/15/2022-12:04:27] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +341, now: CPU 0, GPU 341 (MiB)
[12/15/2022-12:04:27] [I] Engine deserialized in 1.71534 sec.
[12/15/2022-12:04:27] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1813, GPU 1164 (MiB)
[12/15/2022-12:04:27] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1813, GPU 1172 (MiB)
[12/15/2022-12:04:27] [W] [TRT]  (foreignNode) cuBLASLt subversions: compiled against 11.5.1.0 but running against 11.11.3.0.
[12/15/2022-12:04:27] [W] [TRT]  (foreignNode) cuBLASLt subversions: compiled against 11.5.1.0 but running against 11.11.3.0.
[12/15/2022-12:04:27] [W] [TRT]  (foreignNode) cuBLASLt subversions: compiled against 11.5.1.0 but running against 11.11.3.0.
[12/15/2022-12:04:27] [W] [TRT]  (foreignNode) cuBLASLt subversions: compiled against 11.5.1.0 but running against 11.11.3.0.
[12/15/2022-12:04:27] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +5, now: CPU 0, GPU 346 (MiB)
[12/15/2022-12:04:27] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
[12/15/2022-12:04:27] [I] Setting persistentCacheLimit to 0 bytes.
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: wd_tokensAutomatically setting shape to: 1x1
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: wd_offsetAutomatically setting shape to: 1x1
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_0_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_0_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_0_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_0_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_1_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_1_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_1_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_1_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_2_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_2_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_2_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_2_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_3_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_3_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_3_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_3_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_4_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_4_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_4_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_4_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_5_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_5_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_5_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_5_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_6_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_6_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_6_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_6_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_7_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_7_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_7_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_7_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_8_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_8_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_8_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_8_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_9_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_9_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_9_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_9_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_10_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_10_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_10_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_10_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_11_self_keyAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_11_self_valueAutomatically setting shape to: 1x12x1x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_11_cross_keyAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [W] Shape missing for input with dynamic shape: layer_11_cross_valueAutomatically setting shape to: 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input wd_tokens
[12/15/2022-12:04:27] [I] Created input binding for wd_tokens with dimensions 1x1
[12/15/2022-12:04:27] [I] Using random values for input wd_offset
[12/15/2022-12:04:27] [I] Created input binding for wd_offset with dimensions 1x1
[12/15/2022-12:04:27] [I] Using random values for input layer_0_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_0_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_0_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_0_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_0_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_0_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_0_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_0_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_1_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_1_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_1_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_1_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_1_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_1_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_1_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_1_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_2_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_2_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_2_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_2_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_2_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_2_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_2_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_2_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_3_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_3_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_3_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_3_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_3_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_3_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_3_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_3_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_4_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_4_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_4_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_4_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_4_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_4_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_4_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_4_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_5_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_5_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_5_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_5_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_5_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_5_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_5_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_5_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_6_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_6_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_6_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_6_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_6_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_6_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_6_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_6_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_7_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_7_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_7_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_7_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_7_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_7_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_7_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_7_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_8_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_8_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_8_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_8_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_8_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_8_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_8_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_8_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_9_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_9_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_9_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_9_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_9_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_9_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_9_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_9_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_10_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_10_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_10_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_10_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_10_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_10_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_10_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_10_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_11_self_key
[12/15/2022-12:04:27] [I] Created input binding for layer_11_self_key with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_11_self_value
[12/15/2022-12:04:27] [I] Created input binding for layer_11_self_value with dimensions 1x12x1x64
[12/15/2022-12:04:27] [I] Using random values for input layer_11_cross_key
[12/15/2022-12:04:27] [I] Created input binding for layer_11_cross_key with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for input layer_11_cross_value
[12/15/2022-12:04:27] [I] Created input binding for layer_11_cross_value with dimensions 1x12x1500x64
[12/15/2022-12:04:27] [I] Using random values for output layer_0_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_0_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_0_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_0_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_1_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_1_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_1_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_1_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_2_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_2_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_2_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_2_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_3_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_3_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_3_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_3_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_4_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_4_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_4_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_4_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_5_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_5_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_5_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_5_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_6_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_6_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_6_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_6_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_7_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_7_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_7_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_7_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_8_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_8_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_8_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_8_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_9_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_9_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_9_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_9_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_10_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_10_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_10_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_10_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_11_self_key_out
[12/15/2022-12:04:27] [I] Created output binding for layer_11_self_key_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output layer_11_self_value_out
[12/15/2022-12:04:27] [I] Created output binding for layer_11_self_value_out with dimensions 1x12x2x64
[12/15/2022-12:04:27] [I] Using random values for output decoder_out
[12/15/2022-12:04:27] [I] Created output binding for decoder_out with dimensions 1x1x51865
[12/15/2022-12:04:27] [I] Starting inference
[12/15/2022-12:04:27] [I] Warmup completed 0 queries over 0 ms
[12/15/2022-12:04:27] [I] Timing trace has 50 queries over 0.245042 s
[12/15/2022-12:04:27] [I]
[12/15/2022-12:04:27] [I] === Trace details ===
[12/15/2022-12:04:27] [I] Trace averages of 10 runs:
[12/15/2022-12:04:27] [I] Average on 10 runs - GPU latency: 1.98892 ms - Host latency: 6.94839 ms (enqueue 1.55723 ms)
[12/15/2022-12:04:27] [I] Average on 10 runs - GPU latency: 1.98441 ms - Host latency: 6.91743 ms (enqueue 1.38867 ms)
[12/15/2022-12:04:27] [I] Average on 10 runs - GPU latency: 1.98666 ms - Host latency: 6.93235 ms (enqueue 1.36728 ms)
[12/15/2022-12:04:27] [I] Average on 10 runs - GPU latency: 1.98881 ms - Host latency: 6.92795 ms (enqueue 1.39127 ms)
[12/15/2022-12:04:27] [I] Average on 10 runs - GPU latency: 1.97304 ms - Host latency: 6.89634 ms (enqueue 1.32582 ms)
[12/15/2022-12:04:27] [I]
[12/15/2022-12:04:27] [I] === Performance summary ===
[12/15/2022-12:04:27] [I] Throughput: 204.047 qps
[12/15/2022-12:04:27] [I] Latency: min = 6.80411 ms, max = 7.08138 ms, mean = 6.92449 ms, median = 6.91629 ms, percentile(90%) = 6.95837 ms, percentile(95%) = 7.01323 ms, percentile(99%) = 7.08138 ms
[12/15/2022-12:04:27] [I] Enqueue Time: min = 1.2683 ms, max = 2.39101 ms, mean = 1.40606 ms, median = 1.377 ms, percentile(90%) = 1.48555 ms, percentile(95%) = 1.55728 ms, percentile(99%) = 2.39101 ms
[12/15/2022-12:04:27] [I] H2D Latency: min = 4.79916 ms, max = 5.00237 ms, mean = 4.85155 ms, median = 4.84588 ms, percentile(90%) = 4.87872 ms, percentile(95%) = 4.94078 ms, percentile(99%) = 5.00237 ms
[12/15/2022-12:04:27] [I] GPU Compute Time: min = 1.87802 ms, max = 2.02445 ms, mean = 1.98437 ms, median = 1.98758 ms, percentile(90%) = 1.99373 ms, percentile(95%) = 1.99884 ms, percentile(99%) = 2.02445 ms
[12/15/2022-12:04:27] [I] D2H Latency: min = 0.0473633 ms, max = 0.0989113 ms, mean = 0.0885731 ms, median = 0.0892487 ms, percentile(90%) = 0.0930176 ms, percentile(95%) = 0.0947533 ms, percentile(99%) = 0.0989113 ms
[12/15/2022-12:04:27] [I] Total Host Walltime: 0.245042 s
[12/15/2022-12:04:27] [I] Total GPU Compute Time: 0.0992184 s
[12/15/2022-12:04:27] [W] * Throughput may be bound by host-to-device transfers for the inputs rather than GPU Compute and the GPU may be under-utilized.
[12/15/2022-12:04:27] [W]   Add --noDataTransfers flag to disable data transfers.
[12/15/2022-12:04:27] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/15/2022-12:04:27] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8501] # trtexec --loadEngine=decoder_128.plan --warmUp=0 --duration=0 --iterations=50

it just cost 1+ms to infer But when i use python-backend to infer as below:

        d_ins = []
        t = time.time()
        d_ins.append(pb_utils.Tensor.from_dlpack("wd_tokens", to_dlpack(self.tokens[:, -1:].type(torch.int32).cuda())))
        d_ins.append(pb_utils.Tensor.from_dlpack("wd_offset", to_dlpack(torch.tensor([[index+2]]).type(torch.int32).cuda())))
        for i in range(12):
            d_ins.append(pb_utils.Tensor.from_dlpack("layer_%s_self_key" % i, ca_cache[4 * i + 0].to_dlpack()))
            d_ins.append(pb_utils.Tensor.from_dlpack("layer_%s_self_value" % i, ca_cache[4 * i + 1].to_dlpack()))
            d_ins.append(pb_utils.Tensor.from_dlpack("layer_%s_cross_key" % i, ca_cache[4 * i + 2].to_dlpack()))
            d_ins.append(pb_utils.Tensor.from_dlpack("layer_%s_cross_value" % i, ca_cache[4 * i + 3].to_dlpack()))
        logging.error((time.time() - t) * 1000)
        d_req = pb_utils.InferenceRequest(model_name="whisper_decoder_128", requested_output_names=[], inputs=d_ins)
        t = time.time()
        d_rep = d_req.exec()
        logging.error((time.time() - t) * 1000)
ERROR:root:13.051986694335938

It cost 13+ ms to exec a decoder inference I think dlpack is zero-copy, so what's the extra cost for python-backend inference? and how to fix my python-backend code to achieve the ~1ms speed inference?

rmccorm4 commented 1 year ago

Hi @jike-algorithm-zhangxiao ,

On trtexec side:

it just cost 1+ms to infer

I believe you're only looking at the GPU compute time from the trtexec results but not the other latencies involved in the end-to-end process. I think this line may be a bit more of an apples-to-apples comparison:

[12/15/2022-12:04:27] [I] Average on 10 runs - GPU latency: 1.98892 ms - Host latency: 6.94839 ms (enqueue 1.55723 ms)

which looks to be about ~9-11 ms (not sure if enqueue is included in Host or not).

On tritonserver side:

It cost 13+ ms to exec a decoder inference

t = time.time()
d_rep = d_req.exec()
logging.error((time.time() - t) * 1000)

Off the top of my head I don't know exactly what else might be involved in the BLS pipeline other than model execution. I'm not sure if any extra copies/gathers would be involved if keeping memory in-GPU. @Tabrizian could you comment?

zxOnVacation commented 1 year ago

Hi @rmccorm4 In the log

[12/15/2022-12:04:27] [I] === Performance summary ===
[12/15/2022-12:04:27] [I] Throughput: 204.047 qps
[12/15/2022-12:04:27] [I] Latency: min = 6.80411 ms, max = 7.08138 ms, mean = 6.92449 ms, median = 6.91629 ms, percentile(90%) = 6.95837 ms, percentile(95%) = 7.01323 ms, percentile(99%) = 7.08138 ms
[12/15/2022-12:04:27] [I] Enqueue Time: min = 1.2683 ms, max = 2.39101 ms, mean = 1.40606 ms, median = 1.377 ms, percentile(90%) = 1.48555 ms, percentile(95%) = 1.55728 ms, percentile(99%) = 2.39101 ms
[12/15/2022-12:04:27] [I] H2D Latency: min = 4.79916 ms, max = 5.00237 ms, mean = 4.85155 ms, median = 4.84588 ms, percentile(90%) = 4.87872 ms, percentile(95%) = 4.94078 ms, percentile(99%) = 5.00237 ms
[12/15/2022-12:04:27] [I] GPU Compute Time: min = 1.87802 ms, max = 2.02445 ms, mean = 1.98437 ms, median = 1.98758 ms, percentile(90%) = 1.99373 ms, percentile(95%) = 1.99884 ms, percentile(99%) = 2.02445 ms
[12/15/2022-12:04:27] [I] D2H Latency: min = 0.0473633 ms, max = 0.0989113 ms, mean = 0.0885731 ms, median = 0.0892487 ms, percentile(90%) = 0.0930176 ms, percentile(95%) = 0.0947533 ms, percentile(99%) = 0.0989113 ms

we can see ~9-11ms that include H2D latency which cost too much about ~5-6ms. I think the H2D latency is data transfer between host and device, but in tritonserver side, the cache is just in device memory (cause i set

parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS"
              value: {string_value:"no"}}

in python-model config.pbtxt and use dl_pack. So i think in the python-backend module, time cost should be just the device compute time?

Tabrizian commented 1 year ago

Can you try running perf analyzer on the TRT model directly and share the output? Is 13ms observed in all the inferences or only the first inference? There could be an initial warmup time associated with the first few inferences which could be skewing your results.

zxOnVacation commented 1 year ago

Oh i just not use kv-cache, cause faster. thanks for your reply.

yuanquderzi commented 1 year ago

哦,我只是不使用kv缓存,因为更快。感谢您的回复。

请问怎么用triton部署whisper模型?

zxOnVacation commented 1 year ago

我是用api一层一层自己搭的

nlin5 commented 1 year ago

Which API did you use?