Is there any way to reduce the GPU memory usage and enhance the inference speed?

lhwcv / mlsd_pytorch

Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

Apache License 2.0

189 stars 37 forks source link

Is there any way to reduce the GPU memory usage and enhance the inference speed? #19

Open JinraeKim opened 2 years ago

JinraeKim commented 2 years ago

The M-LSD's pred_lines takes a long time than I expected, about ~6Hz (including other stuff; M-LSD-tiny only seems to be about 10Hz).

And it takes about 2G of GPU memory.

Is there a way to reduce the GPU memory usage and enhance the inference speed? (including TensorRT, etc.)

Please give me an adivce as I'm not an expert of this.

Thanks!

lhwcv commented 2 years ago

You can try the TensorRT version by @rhysdg , https://github.com/lhwcv/mlsd_pytorch#benchmarks

JinraeKim commented 2 years ago

You can try the TensorRT version by @rhysdg , https://github.com/lhwcv/mlsd_pytorch#benchmarks

Thx for sharing the link.

I'm not familiar with it. TensorRT would reduce the memory usage and enhance the inference speed at the same time?

rhysdg commented 1 year ago

@JinraeKim @lhwcv Apologies for the late reply, busy times! Forsure the main criteria with TensorRT is to reduce latency, and therefore increase inference speed pretty signifcantly with minimal reduction in quality at FP16. Given a successful conversion you should also see a significant reduction in memory allocation overhead.

Its worth bearing in mind that the setup I have here was developed for Jetson series devices, although my understanding is that it plays nice with Nvdia's NGC PyTorch docker container. I am hoping to start bringing in a TensorrT Python API/ Pycuda version shortly that should work across a wider range of devices. What were you hoping to deploy with @JinraeKim?

JinraeKim commented 1 year ago

@rhysdg Thank you for the detailed explanation! Yeah, I'm looking for employment with Nvidia Jetson as well, and my personal laptops for practice as well.

It gave me a really nice insight! Thank you again!

rhysdg commented 1 year ago

@JinraeKim I'm working on a more robust tool over at trt-devel that adds the ability to convert custom trained models with three channel inputs as per the training code, and drops to the result into a folder named accorrding to experiment. This will eventually become a pr but I'm hoping to do a little more testing with the onnx conversion when I get a chance. For now the tool works if you need it for a custom training run, and I can confirm that the results are fantastic with @lhwcv's training script plus some added aggressive pixel level augs!

After that's done I'll work on a straight TensorRT conversion tool, that has wider device support, and also post-training quantization for the onnx representation!

rhysdg commented 1 year ago

Ah yes, and I'm yet to update the documentation accordingly but adding the --custom experiment.pth arg with your checkpoint dropped into ./models/experiment.pth will result in a sped up representation at _./models/experiment/mlsd_large/tiny__512_trtfp16.pth