Cuda out of memory when training

PureHing commented 1 year ago

Search before asking

[X] I have searched the YOLOv8 issues and discussions and found no similar questions.

Question

Hi,When training any YOLOv5su model, the memory usage on my GPU continues to get higher and higher as training continues.

Command Train Log

``` $ yolo task=detect mode=train model=yolov5su.pt data=./data/mydata.yaml batch=-1 epochs=100 imgsz=960 device=\'3\' rect=true Ultralytics YOLOv8.0.25 🚀 Python-3.10.9 torch-1.13.1+cu117 CUDA:3 (NVIDIA GeForce RTX 3080 Ti, 12045MiB) yolo/engine/trainer: task=detect, mode=train, model=yolov5su.pt, data=./data/crowdhuman.yaml, epochs=100, patience=50, batch=-1, imgsz=960, save=True, cache=False, device=3, workers=8, project=None, name=None, exist_ok=False, pretrained=False, optimizer=SGD, verbose=True, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=True, cos_lr=False, close_mosaic=10, resume=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, hide_labels=False, hide_conf=False, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, fl_gamma=0.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, v5loader=False, save_dir=runs/detect/train4 Overriding model.yaml nc=80 with nc=2 from n params module arguments 0 -1 1 3520 ultralytics.nn.modules.Conv [3, 32, 6, 2, 2] 1 -1 1 18560 ultralytics.nn.modules.Conv [32, 64, 3, 2] 2 -1 1 18816 ultralytics.nn.modules.C3 [64, 64, 1] 3 -1 1 73984 ultralytics.nn.modules.Conv [64, 128, 3, 2] 4 -1 2 115712 ultralytics.nn.modules.C3 [128, 128, 2] 5 -1 1 295424 ultralytics.nn.modules.Conv [128, 256, 3, 2] 6 -1 3 625152 ultralytics.nn.modules.C3 [256, 256, 3] 7 -1 1 1180672 ultralytics.nn.modules.Conv [256, 512, 3, 2] 8 -1 1 1182720 ultralytics.nn.modules.C3 [512, 512, 1] 9 -1 1 656896 ultralytics.nn.modules.SPPF [512, 512, 5] 10 -1 1 131584 ultralytics.nn.modules.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 ultralytics.nn.modules.Concat [1] 13 -1 1 361984 ultralytics.nn.modules.C3 [512, 256, 1, False] 14 -1 1 33024 ultralytics.nn.modules.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 ultralytics.nn.modules.Concat [1] 17 -1 1 90880 ultralytics.nn.modules.C3 [256, 128, 1, False] 18 -1 1 147712 ultralytics.nn.modules.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 ultralytics.nn.modules.Concat [1] 20 -1 1 296448 ultralytics.nn.modules.C3 [256, 256, 1, False] 21 -1 1 590336 ultralytics.nn.modules.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 ultralytics.nn.modules.Concat [1] 23 -1 1 1182720 ultralytics.nn.modules.C3 [512, 512, 1, False] 24 [17, 20, 23] 1 2116822 ultralytics.nn.modules.Detect [2, [128, 256, 512]] YOLOv5s summary: 262 layers, 9122966 parameters, 9122950 gradients, 24.0 GFLOPs Transferred 421/427 items from pretrained weights AutoBatch: Computing optimal batch size for imgsz=960 AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3080 Ti) 11.76G total, 0.07G reserved, 0.07G allocated, 11.62G free Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output 9122966 54.1 0.698 184.3 35.74 (1, 3, 960, 960) list 9122966 108.2 1.168 68.04 34.16 (2, 3, 960, 960) list 9122966 216.4 2.017 51.43 41.66 (4, 3, 960, 960) list 9122966 432.8 3.660 116.2 37.55 (8, 3, 960, 960) list 9122966 865.6 7.403 97.85 75.22 (16, 3, 960, 960) list AutoBatch: Using batch-size 17 for CUDA:0 7.93G/11.76G (67%) ✅ optimizer: SGD(lr=0.01) with parameter groups 69 weight(decay=0.0), 76 weight(decay=0.00053125), 75 bias WARNING ⚠️ 'rect=True' is incompatible with DataLoader shuffle, setting shuffle=False ...... Image sizes 960 train, 960 val Using 8 dataloader workers Logging results to runs/detect/train4 Starting training for 100 epochs... Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 1/100 8.6G 1.759 3.058 1.441 1165 960: 2%|▏ | 18/883 [00:29<23:24, 1.62s/it] Traceback (most recent call last): torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.12 GiB (GPU 0; 11.76 GiB total capacity; 3.92 GiB already allocated; 1.19 GiB free; 7.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Sentry is attempting to send 2 pending error messages ```

pip list

``` Package Version ------------------------ -------------------- absl-py 1.4.0 asttokens 2.2.1 backcall 0.2.0 cachetools 5.3.0 certifi 2022.12.7 charset-normalizer 3.0.1 contourpy 1.0.7 cycler 0.11.0 decorator 5.1.1 executing 1.2.0 fonttools 4.38.0 google-auth 2.16.0 google-auth-oauthlib 0.4.6 grpcio 1.51.1 idna 3.4 ipython 8.9.0 jedi 0.18.2 kiwisolver 1.4.4 Markdown 3.4.1 markdown-it-py 2.1.0 MarkupSafe 2.1.2 matplotlib 3.6.3 matplotlib-inline 0.1.6 mdurl 0.1.2 numpy 1.24.1 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 oauthlib 3.2.2 onnx 1.13.0 onnx-simplifier 0.4.13 opencv-python 4.7.0.68 packaging 23.0 pandas 1.5.3 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.4.0 pip 22.3.1 prompt-toolkit 3.0.36 protobuf 3.20.3 psutil 5.9.4 ptyprocess 0.7.0 pure-eval 0.2.2 pyasn1 0.4.8 pyasn1-modules 0.2.8 Pygments 2.14.0 pyparsing 3.0.9 python-dateutil 2.8.2 pytz 2022.7.1 PyYAML 6.0 requests 2.28.2 requests-oauthlib 1.3.1 rich 13.3.1 rsa 4.9 scipy 1.10.0 seaborn 0.12.2 sentry-sdk 1.14.0 setuptools 65.6.3 six 1.16.0 stack-data 0.6.2 tensorboard 2.11.2 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 thop 0.1.1.post2209072238 torch 1.13.1 torchvision 0.14.1 tqdm 4.64.1 traitlets 5.9.0 typing_extensions 4.4.0 ultralytics 8.0.25 urllib3 1.26.14 wcwidth 0.2.6 Werkzeug 2.2.2 wheel 0.37.1 ```

Additional

No response

AyushExel commented 1 year ago

@PureHing hey thanks for reporting! Can you try to set your batch size to a lower value to confirm that the CUDA memory keeps on increasing and causes OOM? Try something like 8 or 12 batch size for only a few epochs to confirm. Thanks!

PureHing commented 1 year ago

@AyushExel batch=2，the same problem.

AyushExel commented 1 year ago

Okay then looks like a memory leak. What OS are you on?

PureHing commented 1 year ago

@AyushExel Linux server 5.4.0-26-generic #30-Ubuntu SMP Mon Apr 20 16:58:30 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux cuda Info :Ultralytics YOLOv8.0.25 🚀 Python-3.10.9 torch-1.13.1+cu117 CUDA:3 (NVIDIA GeForce RTX 3080 Ti, 12045MiB)

I use batch=-1 to automatically calculate the maximum batch-size value that can be set to 17. Why does batch=2 not work?

JustasBart commented 1 year ago

@PureHing Can you confirm that you don't have like a huge number of images in your training set such so that it's just actually running out? Also try it with batch=1 as well and see if it still runs out...

PureHing commented 1 year ago

@JustasBart Use yolov5su.yaml, set the imgsz to 960, rect training, how much memory is theoretically required?

@PureHing Can you confirm that you don't have like a huge number of images in your training set such so that it's just actually running out? Also try it with batch=1 as well and see if it still runs out...

Image is crowdhuman dataset.

JustasBart commented 1 year ago

@PureHing Hmm... The one last thing I could suggest would be to reboot your machine and try again... But that sounds rather trivial and will likely change nothing... I don't really have anything else on this so I hope you work this out before too long :rocket:

PureHing commented 1 year ago

@JustasBart Set worker=8, batch=2 can run, but it still feels abnormal.

JustasBart commented 1 year ago

@PureHing I would always set the batch size to -1 as in to auto-find the max batch size, but sometimes and especially on the validation step of my first/second Epoch it would run out of GPU Memory so what I would do in that case is that I'd remove a single batch from the theoretical maximum, as in in your case if it can do 17, try doing 16 or 15 and it should be enough to get over it.

Good luck! :rocket:

PureHing commented 1 year ago

@JustasBart

16 or 15

batch = 16 or 15 ，unable to run normally.

w013nad commented 1 year ago

Are you running COCO or a custom dataset. If it's a custom set, how many targets are in the images?

PureHing commented 1 year ago

@AyushExel crowdhuman dataset, training with 15000 images

w013nad commented 1 year ago

@PureHing Going through those annotations, at least one of those images has over 460 targets. Try removing any images with more than 50 targets and it should work better. YOLOv8 creates a separate set of gradients for each target during the loss function. The more targets you have, the more memory it will take.

If you do not wish to do this, I would recommend reverting to either scaled-yolov4, yolov5, or some other architecture.

Edit: Looking further, it looks like 2100 of the images have more than 50 targets.

f = open('annotation_train.odgt', 'r')
a = f.readlines()
max_count = 0
count = 0
for ii in a:

    b = ii.split("tag")
    if len(b) > max_count:
        max_count = len(b)
    if len(b) > 50:
        # print(len(b))
        count+=1
print(f'Number of targets with more than 50 instances {count}')
print(f'Max number of targets in image {max_count}')

Ss-shuang123 commented 1 year ago

@w013nad Thanks. I have the same problem. I think this is right anwser because it is ok when I use yolov5. But how to understand "YOLOv8 creates a separate set of gradients for each target during the loss function". Look forward to your reply！

Burhan-Q commented 1 year ago

Any reason not to use torch.cuda.empty_cache() at the end of an epoch? On an initial test run (on going) training has proceeded highest than before with OoM error and epoch time is shorter by ~30 seconds

glenn-jocher commented 1 year ago

@Burhan-Q interesting idea. Can you please submit a PR with your proposed change and before and after results (speed and CUDA usage?)

Burhan-Q commented 1 year ago

I will see if I can get a benchmark setup that I can share results from in the next week or so, happy to submit a PR if it's worthwhile

glenn-jocher commented 1 year ago

@Burhan-Q great thanks!

github-actions[bot] commented 1 year ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

ultralytics / ultralytics