ONNX Model is slow - Githubissues

iamrajee commented 3 years ago

I'm getting very low fps on running the demo example given in the repository.

On running python3 tools/demo_track.py video -f exps/example/mot/yolox_x_mix_det.py -c pretrained/bytetrack_x_mot17.pth.tar --fp16 --fuse --save_result I get ~1 fps.

2021-11-11 16:43:00.515 | INFO     | __main__:main:298 - Args: Namespace(camid=0, ckpt='pretrained/bytetrack_x_mot17.pth.tar', conf=None, demo='video', device='gpu', exp_file='exps/example/mot/yolox_x_mix_det.py', experiment_name='yolox_x_mix_det', fp16=True, fuse=True, match_thresh=0.8, min_box_area=10, mot20=False, name=None, nms=None, path='./videos/palace.mp4', save_result=True, track_buffer=30, track_thresh=0.5, trt=False, tsize=None)
2021-11-11 16:43:01.281 | INFO     | __main__:main:308 - Model Summary: Params: 99.00M, Gflops: 791.73
2021-11-11 16:43:04.063 | INFO     | __main__:main:319 - loading checkpoint
2021-11-11 16:43:04.470 | INFO     | __main__:main:323 - loaded checkpoint done.
2021-11-11 16:43:04.471 | INFO     | __main__:main:326 -    Fusing model...
/home/rajendra/.local/lib/python3.8/site-packages/torch/nn/modules/module.py:561: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information.
  if param.grad is not None:
2021-11-11 16:43:05.146 | INFO     | __main__:imageflow_demo:240 - video save_path is ./YOLOX_outputs/yolox_x_mix_det/track_vis/2021_11_11_16_43_05/palace.mp4
2021-11-11 16:43:05.148 | INFO     | __main__:imageflow_demo:250 - Processing frame 0 (100000.00 fps)
2021-11-11 16:43:28.900 | INFO     | __main__:imageflow_demo:250 - Processing frame 20 (0.89 fps)
2021-11-11 16:43:52.675 | INFO     | __main__:imageflow_demo:250 - Processing frame 40 (0.89 fps)
2021-11-11 16:44:16.609 | INFO     | __main__:imageflow_demo:250 - Processing frame 60 (0.89 fps)
2021-11-11 16:44:40.757 | INFO     | __main__:imageflow_demo:250 - Processing frame 80 (0.89 fps)
2021-11-11 16:45:05.255 | INFO     | __main__:imageflow_demo:250 - Processing frame 100 (0.88 fps)
2021-11-11 16:45:29.943 | INFO     | __main__:imageflow_demo:250 - Processing frame 120 (0.88 fps)
2021-11-11 16:45:55.705 | INFO     | __main__:imageflow_demo:250 - Processing frame 140 (0.87 fps)
2021-11-11 16:46:21.886 | INFO     | __main__:imageflow_demo:250 - Processing frame 160 (0.86 fps)
2021-11-11 16:46:47.369 | INFO     | __main__:imageflow_demo:250 - Processing frame 180 (0.86 fps)
2021-11-11 16:47:13.220 | INFO     | __main__:imageflow_demo:250 - Processing frame 200 (0.85 fps)
2021-11-11 16:47:38.586 | INFO     | __main__:imageflow_demo:250 - Processing frame 220 (0.85 fps)
2021-11-11 16:48:04.182 | INFO     | __main__:imageflow_demo:250 - Processing frame 240 (0.85 fps)
2021-11-11 16:48:29.777 | INFO     | __main__:imageflow_demo:250 - Processing frame 260 (0.85 fps)
2021-11-11 16:48:55.547 | INFO     | __main__:imageflow_demo:250 - Processing frame 280 (0.85 fps)
2021-11-11 16:49:21.322 | INFO     | __main__:imageflow_demo:250 - Processing frame 300 (0.84 fps)
2021-11-11 16:49:47.657 | INFO     | __main__:imageflow_demo:250 - Processing frame 320 (0.84 fps)

Then I tried using the ONNX model which further slow down the pipeline. I convert my model to ONNX as
python3 tools/export_onnx.py --output-name bytetrack_x.onnx -f exps/example/mot/yolox_x_mix_det.py -c pretrained/bytetrack_x_mot17.pth.tar

and then used it as
python3 onnx_inference.py --model ../../bytetrack_x.onnx --input_shape "800,1440"

I get 0.32 fps as

2021-11-11 18:17:08.041 | INFO     | __main__:imageflow_demo:117 - video save_path is demo_output/palace.mp4
2021-11-11 18:17:08.042 | INFO     | __main__:imageflow_demo:127 - Processing frame 0 (100000.00 fps)
2021-11-11 18:18:15.912 | INFO     | __main__:imageflow_demo:127 - Processing frame 20 (0.30 fps)
2021-11-11 18:19:17.755 | INFO     | __main__:imageflow_demo:127 - Processing frame 40 (0.32 fps)
2021-11-11 18:20:20.536 | INFO     | __main__:imageflow_demo:127 - Processing frame 60 (0.32 fps)
2021-11-11 18:21:22.862 | INFO     | __main__:imageflow_demo:127 - Processing frame 80 (0.32 fps)
2021-11-11 18:22:32.452 | INFO     | __main__:imageflow_demo:127 - Processing frame 100 (0.32 fps)
2021-11-11 18:23:42.442 | INFO     | __main__:imageflow_demo:127 - Processing frame 120 (0.31 fps)
2021-11-11 18:24:49.624 | INFO     | __main__:imageflow_demo:127 - Processing frame 140 (0.31 fps)
2021-11-11 18:26:11.783 | INFO     | __main__:imageflow_demo:127 - Processing frame 160 (0.30 fps)
2021-11-11 18:27:25.753 | INFO     | __main__:imageflow_demo:127 - Processing frame 180 (0.30 fps)
2021-11-11 18:28:36.284 | INFO     | __main__:imageflow_demo:127 - Processing frame 200 (0.30 fps)
2021-11-11 18:29:47.741 | INFO     | __main__:imageflow_demo:127 - Processing frame 220 (0.30 fps)
2021-11-11 18:31:05.641 | INFO     | __main__:imageflow_demo:127 - Processing frame 240 (0.29 fps)

Any thought on how can I get real-time (~15 fps) performance? @AhmedKhaled945 Update: I found a similar issue here #42

iamrajee commented 3 years ago

I could confirm that in my system other darknet yolo versions run way faster using Cuda and cuDNN. Although, I can see that we have set device='gpu', but still gets low fps. Could it be because yolox model is unable to utilize system GPU to its full potential? Can we use cuDNN to make it faster in any way? Any suggestion on what parameter can I change to use the full potential of GPU, Cuda, cuDNN and make ByteTrack faster on the given yolox model? Thank you :)

AhmedKhaled945 commented 2 years ago

You can try the same methodology on Google colab, same procedures you did in your local Machine, if FPS is reasonable, then it is either a build problem or a GPU utilization problem that is sytem related, maybe some bindings or sources missing, so i would rather check a truster systems in terms of builds like colab.

soumajm commented 2 years ago

Can you please check the output of the following?

import onnxruntime as ort
print(ort.get_device())

If you have a GPU and it is still running on the CPU, then this is because of the incompatibility of the onnxruntime version with CUDA and cuDNN versions. Please check this link from onnx for the compatibility matrix.

ifzhang / ByteTrack

ONNX Model is slow #68