wang-xinyu / tensorrtx

Implementation of popular deep learning networks with TensorRT network definition API
MIT License
6.98k stars 1.77k forks source link

Gpu memory 显存占用增加 #689

Closed JonyJiang123 closed 2 years ago

JonyJiang123 commented 3 years ago

Env

Yolov5

在我尝试tensorrt加速yolov5时 我的步骤

  1. 将自己训练的best.pt转成best.wts
  2. 用sudo ./yolov5 -s yolov5s.wts yolov5s.engine s 命令生成.engine文件
  3. 用yolov5_trt.py测试 能正常运行且Fps从50→100左右 但是显存增加了一倍多 Tensorrt 前: image Tensorrt 后: image

鑫宇大佬帮忙看看

wang-xinyu commented 3 years ago

Try ./yolov5 -d instead of .py.

JonyJiang123 commented 3 years ago

刚试了./yolov5 -d 确实显存只有1G,为什么trt.py增加了一倍

wang-xinyu commented 3 years ago

You can try to remove torchvision.nms in .py.

JonyJiang123 commented 3 years ago

刚试了下,跟之前一样,还是占用4G显存。

JonyJiang123 commented 3 years ago

代码段显存占用情况

def __init__(self, engine_file_path):
    self.ctx = cuda.Device(0).make_context()
    stream = cuda.Stream()
    TRT_LOGGER = trt.Logger(trt.Logger.INFO)
    runtime = trt.Runtime(TRT_LOGGER) 
    with open(engine_file_path, "rb") as f:
        engine_data = f.read() 

这里加载完就已经占用了 image

wang-xinyu commented 3 years ago

You can try tensorrt 8.0.

JonyJiang123 commented 3 years ago

测试了Tensorrt 8.0.1.6,问题依旧一样。 无Tensort 显存: 1500M ./yolov5 -d 显存: 1100M yolov5_trt.py 显存: 4300M

wang-xinyu commented 3 years ago

No idea, many others have almost same memory cost in c++ and py. You can try other machine or open a thread in nvidia devtalk.

MTDzi commented 3 years ago

I have the exact same issue but using a Jetson Nano 4GB which simply doesn't have enough memory to run the yolo5_trt.py script. I didn't even notice the problem on a different machine since it has much more memory, maybe that's why no one reported this issue before?

@wang-xinyu a penny for your thoughts: might this have something to do with the .so plugin used here? I now noticed that the RetinaNet example also uses a plugin (here) but currently I don't have the option of checking whether it eats up as much memory as well.

Question: Might this be caused by ctypes loading the whole static library with a lot of redundant sub-libraries, which the Cpp version does not do?

wang-xinyu commented 3 years ago

@MTDzi You can verify your thoughts by removing yololayer plugin in createEngine() in yolov5.cpp.

MTDzi commented 3 years ago

@wang-xinyu I wasn't sure how to do (I couldn't find the createEngine function, I assumed you meant build_engine) so this is my attempt:

I replaced the following in yolov5.cpp:

auto yolo = addYoLoLayer(network, weightMap, det0, det1, det2);
yolo->getOutput(0)->setName(OUTPUT_BLOB_NAME);
network->markOutput(*yolo->getOutput(0));

with

network->markOutput(*det0->getOutput(0));

and built the .engine file, then commented out the following line in yolov5_trt.py:

# ctypes.CDLL(PLUGIN_LIBRARY)

and ran python yolov5_trt.py.

But the memory consumption is the same, I'm looking at an increase from 1.5GB to 3.5GB (so that's ~2GB worth of RAM).

When running the Cpp version I get an increase of ~800MB.


If this is not what you were asking for, please give me a hint.

wang-xinyu commented 3 years ago

That‘s right, removing yololayer.

So it seems not the issue from plugin.

I think probably tensorrt python or other py packages causing the memory increase.

MTDzi commented 3 years ago

You might be right, I found this thread on NVIDIA's forum. Also this is an interesting lead where they suggest using cgroups to limit the amount of available memory which would, I guess, force CUDA runtime to free up some pieces of memory.

And, well, that would explain why the Cpp version needs roughly 800MB. But what I don't get is why the Python version needs so much more.

I'll give the cgroups approach a try and let you know how it went.

MTDzi commented 3 years ago

Followup: the cgroups solution didn't help, I see only a slightly lower memory consumption (3.3GB).

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tufuwq commented 4 months ago

I meet this problem too. Using multiprocessing to set individual prcess may solve it.