ultralytics / ultralytics

Ultralytics YOLO11 πŸš€
https://docs.ultralytics.com
GNU Affero General Public License v3.0
32.48k stars 6.25k forks source link

WSL Ubuntu2204 trains YOLOV8 custom model with CUDA out of memory error when there is sufficient video memory #16862

Open RUIHANGxing opened 1 month ago

RUIHANGxing commented 1 month ago

Search before asking

Question

My device is the 11th i7, GPU is RTX3090 24GB, RAM is 32GB, Windows version is 22H2, I have installed WSL Ubuntu2204 for model training, WSL is equipped with 24GB of RAM, swap is set to 7GB, CUDA version 12.1, Python version 3.10.14, PyTorch version 2.3.1, and the dataset used is a very small safety helmet dataset. During training, YOLO's workers option can only be set to a maximum of 1, and batch can only be set to a maximum of 16, otherwise an error will be reported:

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

However, upon checking the usage of video memory, it was found that the usage was not high and the memory was clearly sufficient. The error message was located in dataloader.py and pin_memory.py, which are responsible for loading the dataset. After researching, I suspect that loading the dataset may have caused insufficient memory or video memory. However, after detecting changes in memory and video memory, both memory and video memory were not significantly occupied, and there is still a lot of room for improvement. I set the workers to 0 through testing, and the training was basically normal except for a decrease in speed. After testing to batch=128, there was not enough video memory before the OOM error was reported. I request help to solve the problem. I hope to use the workers setting in WSL to improve my training speed. Thank you.

Complete error message:

Ultralytics 8.3.7 πŸš€ Python-3.10.14 torch-2.3.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB) engine/trainer: task=detect, mode=train, model=/home/ruihangxing/cache/yoloRe/yolov8n.pt, data=mydataset.yaml, epochs=32, time=None, patience=100, batch=64, imgsz=640, save=True, save_period=-1, cache=None, device=0, workers=8, project=None, name=train36, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=/home/ruihangxing/yolov8_test/ultralytics/runs/detect/train36 Overriding model.yaml nc=80 with nc=10

               from  n    params  module                                       arguments                     

0 -1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2]
1 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2]
2 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True]
3 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
4 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True]
5 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2]
6 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
7 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2]
8 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True]
9 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1]
16 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2]
17 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1]
19 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2]
20 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1]
21 -1 1 493056 ultralytics.nn.modules.block.C2f [384, 256, 1]
22 [15, 18, 21] 1 432622 ultralytics.nn.modules.head.Detect [10, [64, 128, 256]]
Model summary: 249 layers, 2,692,158 parameters, 2,692,142 gradients, 7.0 GFLOPs

Transferred 313/391 items from pretrained weights TensorBoard: Start with 'tensorboard --logdir /home/ruihangxing/yolov8_test/ultralytics/runs/detect/train36', view at http://localhost:6006/ Freezing layer 'model.22.dfl.conv.weight' AMP: running Automatic Mixed Precision (AMP) checks with YOLO11n... AMP: checks passed βœ… train: Scanning /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/train/labels.cache... 2605 images, 6 backgrounds, 0 train: WARNING ⚠️ /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/train/images/004720_jpg.rf.afc486560a4004c7cfd67910af31a29c.jpg: 1 duplicate labels removed train: WARNING ⚠️ /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/train/images/construction-813-_jpg.rf.b085952261fd98f2e76b8065de149b5f.jpg: 1 duplicate labels removed val: Scanning /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/valid/labels.cache... 114 images, 10 backgrounds, 0 c Plotting labels to /home/ruihangxing/yolov8_test/ultralytics/runs/detect/train36/labels.jpg... optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... optimizer: AdamW(lr=0.000714, momentum=0.9) with parameter groups 63 weight(decay=0.0), 70 weight(decay=0.0005), 69 bias(decay=0.0) TensorBoard: model graph visualization added βœ… Image sizes 640 train, 640 val Using 8 dataloader workers Logging results to /home/ruihangxing/yolov8_test/ultralytics/runs/detect/train36 Starting training for 32 epochs...

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
   1/32      10.8G      1.574      3.962      1.574       1584        640:   7%|β–‹         | 3/41 [00:01<00:23,  1.59it/s]

Traceback (most recent call last): File "/home/ruihangxing/cache/yoloRe/ultralytics/main.py", line 11, in model.train(data="mydataset.yaml",epochs=32) File "/home/ruihangxing/cache/yoloRe/ultralytics/ultralytics/engine/model.py", line 804, in train self.trainer.train() File "/home/ruihangxing/cache/yoloRe/ultralytics/ultralytics/engine/trainer.py", line 207, in train self._do_train(world_size) File "/home/ruihangxing/cache/yoloRe/ultralytics/ultralytics/engine/trainer.py", line 367, in _do_train for i, batch in pbar: File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/home/ruihangxing/cache/yoloRe/ultralytics/ultralytics/data/build.py", line 48, in iter yield next(self.iterator) File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise raise exception RuntimeError: Caught RuntimeError in pin memory thread for device 0. Original Traceback (most recent call last): File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 37, in do_one_step data = pin_memory(data, device) File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 68, in pin_memory clone.update({k: pin_memory(sample, device) for k, sample in data.items()}) File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 68, in clone.update({k: pin_memory(sample, device) for k, sample in data.items()}) File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory return data.pin_memory(device) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Aborted

My training script:

import torch

import cv2

from ultralytics import YOLO

import os

# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# torch.backends.cudnn.enabled=False

model = YOLO("/home/ruihangxing/cache/yoloRe/yolov8n.pt")

model.train(data="mydataset.yaml",epochs=32)

My dataset YAML file:

train: /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/train/images/

val: /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/valid/images/

test: /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/test/images/

nc: 10

names: [Hardhat,Mask,NO-Hardhat,NO-Mask,NO-Safety-Vest,Person,Safety-Cone,Safety-Vest,Machinery,Vehicle]

Additional

and i set workers in default.yaml

UltralyticsAssistant commented 1 month ago

πŸ‘‹ Hello @RUIHANGxing, thank you for reaching out to Ultralytics πŸš€! This is an automated response, and an Ultralytics engineer will assist you soon.

We recommend checking out our Docs for guidance. Specifically, explore Python and CLI for various usage examples.

As you're experiencing a πŸ› Bug, we kindly ask you to provide a minimum reproducible example to help us address the issue swiftly.

In case your issue is related to custom training, ensure you’re following our Tips for Best Training Results and please include comprehensive information like dataset examples and logs.

For community support, join us on Discord 🎧, engage in deeper conversations on Discourse, or dive into discussions on our Subreddit.

Upgrade

Ensure you are using the latest version of ultralytics and have met all requirements, particularly in a Python>=3.8 environment with PyTorch>=1.8:

pip install -U ultralytics

Environments

You can run YOLO in any of the following verified environments:

Status

Check the current status of our Ultralytics CI tests here: Ultralytics CI

This badge indicates if all tests are passing, ensuring correct operation across platforms.

Thank you for your patience! 😊

RUIHANGxing commented 1 month ago

And when I set workers=1, it can support training until batch=32. If the batch size is larger, it is not feasible. However, at this time, my video memory is less than 10G. With workers=1 and batch=64, I can only complete 4 rounds of training, and the maximum video memory usage is only 13G

RUIHANGxing commented 1 month ago

I am just a rookie in YOLO, and I cannot accurately infer whether it is a CUDA problem or a WSL or YOLO problem. However, with workers=0, I can train up to a maximum of batch=128. At this point, I can see that the video memory is fully occupied, and the OOM error is normal at this time. However, I cannot understand the rest, so I speculate that it is YOLO's problem? I couldn't find a feasible solution online

Y-T-G commented 1 month ago

Why use WSL? ultralytics works with Windows natively

RUIHANGxing commented 1 month ago

Why use WSL? ultralytics works with Windows natively The computer is a public computer in the laboratory, and I cannot directly install Linux on it. Additionally, I cannot access Docker Hub, so I use WSL to isolate other people's environments. My goal is to deploy the model on some Linux development boards, so I want to try to maintain a consistent system environment for the entire environment

glenn-jocher commented 1 month ago

Using WSL for consistency across environments makes sense. To address your memory issue, try reducing the image size or using mixed precision with amp=True. This can help manage GPU memory usage more effectively.

Y-T-G commented 1 month ago

Training on Windows environment and deploying on a Linux environment, or vice-versa shouldn't be problematic at all.

If you are concerned about training speed and performance, then you should be running the training in the native environment instead of an environment that's virtualized and inevitably has performance degradation.

RUIHANGxing commented 1 month ago

Using WSL for consistency across environments makes sense. To address your memory issue, try reducing the image size or using mixed precision with amp=True. This can help manage GPU memory usage more effectively. I have tested with amp=True and also tried reducing the batch size. Now I have pinpointed the issue to the setting of workers. When I set workers=0, everything works fine. When I set workers=1, I can run it with a batch size of 32, but if I also set cache=True, it quickly throws an error. If I can't use workers, my training speed drops significantly to about 2 it/s. The documentation for WSL indicates that there is not much difference between WSL Linux and native Linux. I can't identify the exact cause, as I am monitoring real-time memory and GPU usage, and there is still plenty of available space.

glenn-jocher commented 4 weeks ago

It seems like the issue might be related to how WSL handles multiprocessing. You could try setting pin_memory=False in your DataLoader, which might help with the memory errors. Additionally, ensure your NVIDIA drivers and WSL are fully updated.

Y-T-G commented 4 weeks ago

WSL2 issue with pinned memory.

https://github.com/microsoft/WSL/issues/8447

RUIHANGxing commented 4 weeks ago

WSL2 issue with pinned memory.

microsoft/WSL#8447

Thank you very much. This may be a problem with WSL, but I did not find a solution to my problem in the WSL issue. Maybe I should do the training on native Linux as you said, or just use WSL to make a simple environment isolation on the public computer in the laboratory. Thank you very much

glenn-jocher commented 4 weeks ago

You're welcome! Training on native Linux might indeed resolve the issue. If you need further assistance, feel free to reach out.