ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
51.21k stars 16.44k forks source link

GPU memory can not be fully used #2101

Closed lsd1994 closed 3 years ago

lsd1994 commented 3 years ago

I train my custom dataset on RTX2070, 8GB. The image size is 1920 x 1080, when I use: python train.py --data data.yaml --hyp v5s_hyp.yaml --cfg yolov5s.yaml --weights yolov5s.pt --img-size 1664 --rect --batch-size 2 It can run normally and use 2.69G GPU memory during training. But when I increase the image size to 1696: python train.py --data data.yaml --hyp v5s_hyp.yaml --cfg yolov5s.yaml --weights yolov5s.pt --img-size 1696 --rect --batch-size 2 The result is: CUDA out of memory. Tried to allocate 2.96 GiB (GPU 0; 8.00 GiB total capacity; 1.34 GiB already allocated; 3.58 GiB free; 2.75 GiB reserved in total by PyTorch)

So I want to know why GPU usage have huge difference between img size 1664 and 1696, thanks.

github-actions[bot] commented 3 years ago

👋 Hello @lsd1994, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

wudashuo commented 3 years ago

GPU usage is not always the same, it's changing during the training process. According to my observation, usually GPU usage is higher during pre-processing. Maybe training didn't run too much memory, but it's processing the high-resolution image that occupied a lot of GPU memory? BTW, I wonder why you raise the img-size so high, and batch-size is only 2? It would be very slow. If you want a higher mAP, maybe you can use yolov5m/l/x, rather than raise the img-size while using yolov5s.

lsd1994 commented 3 years ago

@wudashuo Hi, thanks for quick reply.

GPU usage is not always the same, it's changing during the training process. According to my observation, usually GPU usage is higher during pre-processing. Maybe training didn't run too much memory, but it's processing the high-resolution image that occupied a lot of GPU memory?

I see, pre-processing uses the highest GPU memory when I check GPU-Z, then memory decreases to normal.

BTW, I wonder why you raise the img-size so high, and batch-size is only 2? It would be very slow. If you want a higher mAP, maybe you can use yolov5m/l/x, rather than raise the img-size while using yolov5s.

Thank you for advise, for now I just test different models on my computer.

lsd1994 commented 3 years ago

It seems my pycharm problem. I run the command in pycharm and when I restart pychram I can run python train.py --data data.yaml --hyp v5s_hyp.yaml --cfg yolov5s.yaml --weights yolov5s.pt --img-size 1920 --rect --batch-size 2 and use 3.5G GPU memory during training.