ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
51.02k stars 16.41k forks source link

Why is there a memory overflow when the total gpu memory is 12g and each batch uses 4.5g #13395

Open heyxhh opened 3 weeks ago

heyxhh commented 3 weeks ago

Search before asking

Question

Hello,I tried to train YOLO V5 with config "imgsz=640, batch_size=16", there was no problem in the first two training epochs, but a gpu memory overflow occurred during the loading of data in the third epoch. The total gpu memory is 12g and each batch uses 4.5g, may I ask what could be the reason for this, The details of the problem can be seen in the screenshot。

ques

Additional

No response

UltralyticsAssistant commented 3 weeks ago

👋 Hello @heyxhh, thank you for reaching out and sharing your question about YOLOv5 🚀!

GPU memory overflow can be a tricky issue to debug. To help us better understand and assist you, could you please provide a minimum reproducible example? This would include any configurations you are using and, if possible, a snippet of your code. Also, let us know if there are any specific error messages you are encountering, as shown in your screenshot.

In the meantime, you might want to verify whether your environment meets the requirements:

YOLOv5 is optimized to run in a variety of environments, and keeping your drivers up to date can sometimes resolve memory issues.

Remember, this is an automated response 🤖, and an Ultralytics engineer will review your issue soon to provide more personalized assistance. Thank you for your patience and understanding!

pderrenger commented 1 week ago

@heyxhh it sounds like there might be a memory leak or accumulation of data in the GPU memory over epochs. Please ensure you're using the latest version of YOLOv5 and PyTorch, as updates often include important fixes. Additionally, try reducing the batch size or using gradient checkpointing to manage memory usage more effectively.