mit-han-lab / efficientvit

EfficientViT is a new family of vision models for efficient high-resolution vision.
Apache License 2.0
1.79k stars 164 forks source link

About TensorRT Latency Measure #52

Closed johnyang-nv closed 4 months ago

johnyang-nv commented 10 months ago

Hello,

I really appreciate your effort on the paper and the novel method of yours.

We tried replicating your reported latency measure on Jetson ORIN board on our side. What we have got seems to differ by a lot from what is reported: EfficientViT-B2: 15.84 ms (reported to be 2.8ms in the paper) EfficientViT-B3: 44.30 ms (reported to be 4.4ms in the paper), which are the latency results of FP16 inference on the latest version of TensorRT.

Could you please help us retrieve what you obtained for the latency of your models on TensoRT inference in ORIN platform?

johnyang-nv commented 8 months ago

Hello,

I'm just checking in to see if there have been any updates or developments on this issue I've raised previously. Your feedback or a status update would be greatly appreciated.

Thanks!

han-cai commented 8 months ago

Hi johnyang-nv,

We measured the latency on the orin platform using the following command:

trtexec --separateProfileRun --iterations=100 --duration=0 --fp16 --onnx=b3_imagenet.onnx

onnx file: b3_imagenet.onnx

Details are attached below: [01/08/2024-15:39:39] [I] === Device Information === [01/08/2024-15:39:39] [I] Selected Device: Orin [01/08/2024-15:39:39] [I] Compute Capability: 8.7 [01/08/2024-15:39:39] [I] SMs: 16 [01/08/2024-15:39:39] [I] Compute Clock Rate: 1.3 GHz [01/08/2024-15:39:39] [I] Device Global Memory: 30592 MiB [01/08/2024-15:39:39] [I] Shared Memory per SM: 164 KiB [01/08/2024-15:39:39] [I] Memory Bus Width: 256 bits (ECC disabled) [01/08/2024-15:39:39] [I] Memory Clock Rate: 1.3 GHz [01/08/2024-15:39:39] [I] [01/08/2024-15:39:39] [I] TensorRT version: 8.5.2 [01/08/2024-15:39:41] [I] [TRT] [MemUsageChange] Init CUDA: CPU +220, GPU +0, now: CPU 249, GPU 7625 (MiB) [01/08/2024-15:39:42] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +302, GPU +285, now: CPU 574, GPU 7931 (MiB) [01/08/2024-15:39:42] [I] Start parsing network model [01/08/2024-15:39:42] [I] [TRT] ---------------------------------------------------------------- [01/08/2024-15:39:42] [I] [TRT] Input filename: b3_imagenet.onnx [01/08/2024-15:39:42] [I] [TRT] ONNX IR version: 0.0.6 [01/08/2024-15:39:42] [I] [TRT] Opset version: 11 [01/08/2024-15:39:42] [I] [TRT] Producer name: pytorch [01/08/2024-15:39:42] [I] [TRT] Producer version: 2.0.1 [01/08/2024-15:39:42] [I] [TRT] Domain: [01/08/2024-15:39:42] [I] [TRT] Model version: 0 [01/08/2024-15:39:42] [I] [TRT] Doc string:

[01/08/2024-15:46:22] [I] === Trace details === [01/08/2024-15:46:22] [I] Trace averages of 10 runs: [01/08/2024-15:46:22] [I] Average on 10 runs - GPU latency: 4.14433 ms - Host latency: 4.17995 ms (enqueue 2.04011 ms) [01/08/2024-15:46:22] [I] Average on 10 runs - GPU latency: 4.14283 ms - Host latency: 4.17722 ms (enqueue 2.02904 ms) [01/08/2024-15:46:22] [I] Average on 10 runs - GPU latency: 4.14048 ms - Host latency: 4.17951 ms (enqueue 1.99993 ms) [01/08/2024-15:46:22] [I] Average on 10 runs - GPU latency: 4.14606 ms - Host latency: 4.18004 ms (enqueue 2.02749 ms) [01/08/2024-15:46:22] [I] Average on 10 runs - GPU latency: 4.14613 ms - Host latency: 4.18188 ms (enqueue 2.02407 ms) [01/08/2024-15:46:22] [I] Average on 10 runs - GPU latency: 4.14133 ms - Host latency: 4.17758 ms (enqueue 2.00409 ms) [01/08/2024-15:46:22] [I] Average on 10 runs - GPU latency: 4.1424 ms - Host latency: 4.17748 ms (enqueue 2.03502 ms) [01/08/2024-15:46:22] [I] Average on 10 runs - GPU latency: 4.14392 ms - Host latency: 4.17927 ms (enqueue 2.02808 ms) [01/08/2024-15:46:22] [I] Average on 10 runs - GPU latency: 4.1422 ms - Host latency: 4.17714 ms (enqueue 2.01807 ms) [01/08/2024-15:46:22] [I] Average on 10 runs - GPU latency: 4.1592 ms - Host latency: 4.19431 ms (enqueue 2.0354 ms) [01/08/2024-15:46:22] [I] [01/08/2024-15:46:22] [I] === Performance summary === [01/08/2024-15:46:22] [I] Throughput: 238.838 qps [01/08/2024-15:46:22] [I] Latency: min = 4.14728 ms, max = 4.34027 ms, mean = 4.18044 ms, median = 4.1785 ms, percentile(90%) = 4.18988 ms, percentile(95%) = 4.19229 ms, percentile(99%) = 4.34027 ms [01/08/2024-15:46:22] [I] Enqueue Time: min = 1.95618 ms, max = 2.06042 ms, mean = 2.02413 ms, median = 2.02873 ms, percentile(90%) = 2.04553 ms, percentile(95%) = 2.04871 ms, percentile(99%) = 2.06042 ms [01/08/2024-15:46:22] [I] H2D Latency: min = 0.0269775 ms, max = 0.0407104 ms, mean = 0.0320795 ms, median = 0.0315399 ms, percentile(90%) = 0.0371094 ms, percentile(95%) = 0.0386047 ms, percentile(99%) = 0.0407104 ms [01/08/2024-15:46:22] [I] GPU Compute Time: min = 4.11496 ms, max = 4.30731 ms, mean = 4.14489 ms, median = 4.14325 ms, percentile(90%) = 4.15424 ms, percentile(95%) = 4.1568 ms, percentile(99%) = 4.30731 ms [01/08/2024-15:46:22] [I] D2H Latency: min = 0.00184631 ms, max = 0.00598145 ms, mean = 0.00347153 ms, median = 0.00289917 ms, percentile(90%) = 0.00527954 ms, percentile(95%) = 0.00567627 ms, percentile(99%) = 0.00598145 ms [01/08/2024-15:46:22] [I] Total Host Walltime: 0.418694 s [01/08/2024-15:46:22] [I] Total GPU Compute Time: 0.414489 s [01/08/2024-15:46:22] [I] Explanations of the performance metrics are printed in the verbose logs. [01/08/2024-15:46:22] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8502] # trtexec --separateProfileRun --iterations=100 --duration=0 --fp16 --onnx=b3_imagenet.onnx

Hope this can help you find the problems.

Best, Han

narsisn commented 4 months ago

Hello,

Thanks for your great work.

I was wondering if you could explain how you calculated the throughput. Thanks!

johnyang-nv commented 4 months ago

Install/build TensorRT on your desired platform (e.g. ORIN, RTX 3090 etc.) and run the command : trtexec --separateProfileRun --iterations=100 --duration=0 --fp16 --onnx=b3_imagenet.onnx