Open RUIHANGxing opened 1 month ago
π Hello @RUIHANGxing, thank you for reaching out to Ultralytics π! This is an automated response, and an Ultralytics engineer will assist you soon.
We recommend checking out our Docs for guidance. Specifically, explore Python and CLI for various usage examples.
As you're experiencing a π Bug, we kindly ask you to provide a minimum reproducible example to help us address the issue swiftly.
In case your issue is related to custom training, ensure youβre following our Tips for Best Training Results and please include comprehensive information like dataset examples and logs.
For community support, join us on Discord π§, engage in deeper conversations on Discourse, or dive into discussions on our Subreddit.
Ensure you are using the latest version of ultralytics
and have met all requirements, particularly in a Python>=3.8 environment with PyTorch>=1.8:
pip install -U ultralytics
You can run YOLO in any of the following verified environments:
Check the current status of our Ultralytics CI tests here:
This badge indicates if all tests are passing, ensuring correct operation across platforms.
Thank you for your patience! π
And when I set workers=1, it can support training until batch=32. If the batch size is larger, it is not feasible. However, at this time, my video memory is less than 10G. With workers=1 and batch=64, I can only complete 4 rounds of training, and the maximum video memory usage is only 13G
I am just a rookie in YOLO, and I cannot accurately infer whether it is a CUDA problem or a WSL or YOLO problem. However, with workers=0, I can train up to a maximum of batch=128. At this point, I can see that the video memory is fully occupied, and the OOM error is normal at this time. However, I cannot understand the rest, so I speculate that it is YOLO's problem? I couldn't find a feasible solution online
Why use WSL? ultralytics
works with Windows natively
Why use WSL?
ultralytics
works with Windows natively The computer is a public computer in the laboratory, and I cannot directly install Linux on it. Additionally, I cannot access Docker Hub, so I use WSL to isolate other people's environments. My goal is to deploy the model on some Linux development boards, so I want to try to maintain a consistent system environment for the entire environment
Using WSL for consistency across environments makes sense. To address your memory issue, try reducing the image size or using mixed precision with amp=True
. This can help manage GPU memory usage more effectively.
Training on Windows environment and deploying on a Linux environment, or vice-versa shouldn't be problematic at all.
If you are concerned about training speed and performance, then you should be running the training in the native environment instead of an environment that's virtualized and inevitably has performance degradation.
Using WSL for consistency across environments makes sense. To address your memory issue, try reducing the image size or using mixed precision with
amp=True
. This can help manage GPU memory usage more effectively. I have tested with amp=True and also tried reducing the batch size. Now I have pinpointed the issue to the setting of workers. When I set workers=0, everything works fine. When I set workers=1, I can run it with a batch size of 32, but if I also set cache=True, it quickly throws an error. If I can't use workers, my training speed drops significantly to about 2 it/s. The documentation for WSL indicates that there is not much difference between WSL Linux and native Linux. I can't identify the exact cause, as I am monitoring real-time memory and GPU usage, and there is still plenty of available space.
It seems like the issue might be related to how WSL handles multiprocessing. You could try setting pin_memory=False
in your DataLoader, which might help with the memory errors. Additionally, ensure your NVIDIA drivers and WSL are fully updated.
WSL2 issue with pinned memory.
WSL2 issue with pinned memory.
Thank you very much. This may be a problem with WSL, but I did not find a solution to my problem in the WSL issue. Maybe I should do the training on native Linux as you said, or just use WSL to make a simple environment isolation on the public computer in the laboratory. Thank you very much
You're welcome! Training on native Linux might indeed resolve the issue. If you need further assistance, feel free to reach out.
Search before asking
Question
My device is the 11th i7, GPU is RTX3090 24GB, RAM is 32GB, Windows version is 22H2, I have installed WSL Ubuntu2204 for model training, WSL is equipped with 24GB of RAM, swap is set to 7GB, CUDA version 12.1, Python version 3.10.14, PyTorch version 2.3.1, and the dataset used is a very small safety helmet dataset. During training, YOLO's workers option can only be set to a maximum of 1, and batch can only be set to a maximum of 16, otherwise an error will be reported:
RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
However, upon checking the usage of video memory, it was found that the usage was not high and the memory was clearly sufficient. The error message was located in dataloader.py and pin_memory.py, which are responsible for loading the dataset. After researching, I suspect that loading the dataset may have caused insufficient memory or video memory. However, after detecting changes in memory and video memory, both memory and video memory were not significantly occupied, and there is still a lot of room for improvement. I set the workers to 0 through testing, and the training was basically normal except for a decrease in speed. After testing to batch=128, there was not enough video memory before the OOM error was reported. I request help to solve the problem. I hope to use the workers setting in WSL to improve my training speed. Thank you.
Complete error message:
Ultralytics 8.3.7 π Python-3.10.14 torch-2.3.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB) engine/trainer: task=detect, mode=train, model=/home/ruihangxing/cache/yoloRe/yolov8n.pt, data=mydataset.yaml, epochs=32, time=None, patience=100, batch=64, imgsz=640, save=True, save_period=-1, cache=None, device=0, workers=8, project=None, name=train36, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=/home/ruihangxing/yolov8_test/ultralytics/runs/detect/train36 Overriding model.yaml nc=80 with nc=10
0 -1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2]
1 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2]
2 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True]
3 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
4 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True]
5 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2]
6 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
7 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2]
8 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True]
9 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1]
16 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2]
17 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1]
19 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2]
20 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1]
21 -1 1 493056 ultralytics.nn.modules.block.C2f [384, 256, 1]
22 [15, 18, 21] 1 432622 ultralytics.nn.modules.head.Detect [10, [64, 128, 256]]
Model summary: 249 layers, 2,692,158 parameters, 2,692,142 gradients, 7.0 GFLOPs
Transferred 313/391 items from pretrained weights TensorBoard: Start with 'tensorboard --logdir /home/ruihangxing/yolov8_test/ultralytics/runs/detect/train36', view at http://localhost:6006/ Freezing layer 'model.22.dfl.conv.weight' AMP: running Automatic Mixed Precision (AMP) checks with YOLO11n... AMP: checks passed β train: Scanning /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/train/labels.cache... 2605 images, 6 backgrounds, 0 train: WARNING β οΈ /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/train/images/004720_jpg.rf.afc486560a4004c7cfd67910af31a29c.jpg: 1 duplicate labels removed train: WARNING β οΈ /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/train/images/construction-813-_jpg.rf.b085952261fd98f2e76b8065de149b5f.jpg: 1 duplicate labels removed val: Scanning /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/valid/labels.cache... 114 images, 10 backgrounds, 0 c Plotting labels to /home/ruihangxing/yolov8_test/ultralytics/runs/detect/train36/labels.jpg... optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... optimizer: AdamW(lr=0.000714, momentum=0.9) with parameter groups 63 weight(decay=0.0), 70 weight(decay=0.0005), 69 bias(decay=0.0) TensorBoard: model graph visualization added β Image sizes 640 train, 640 val Using 8 dataloader workers Logging results to /home/ruihangxing/yolov8_test/ultralytics/runs/detect/train36 Starting training for 32 epochs...
Traceback (most recent call last): File "/home/ruihangxing/cache/yoloRe/ultralytics/main.py", line 11, in
model.train(data="mydataset.yaml",epochs=32)
File "/home/ruihangxing/cache/yoloRe/ultralytics/ultralytics/engine/model.py", line 804, in train
self.trainer.train()
File "/home/ruihangxing/cache/yoloRe/ultralytics/ultralytics/engine/trainer.py", line 207, in train
self._do_train(world_size)
File "/home/ruihangxing/cache/yoloRe/ultralytics/ultralytics/engine/trainer.py", line 367, in _do_train
for i, batch in pbar:
File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/tqdm/std.py", line 1181, in iter
for obj in iterable:
File "/home/ruihangxing/cache/yoloRe/ultralytics/ultralytics/data/build.py", line 48, in iter
yield next(self.iterator)
File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
raise exception
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 37, in do_one_step
data = pin_memory(data, device)
File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 68, in pin_memory
clone.update({k: pin_memory(sample, device) for k, sample in data.items()})
File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 68, in
clone.update({k: pin_memory(sample, device) for k, sample in data.items()})
File "/home/ruihangxing/miniconda3/envs/yolov8/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory
return data.pin_memory(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Aborted
My training script:
import torch
import cv2
from ultralytics import YOLO
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
# os.environ['CUDA_VISIBLE_DEVICES'] = '0'
# torch.backends.cudnn.enabled=False
model = YOLO("/home/ruihangxing/cache/yoloRe/yolov8n.pt")
model.train(data="mydataset.yaml",epochs=32)
My dataset YAML file:
train: /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/train/images/
val: /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/valid/images/
test: /home/ruihangxing/cache/yoloRe/ultralytics/Dataset/css-data/test/images/
nc: 10
names: [Hardhat,Mask,NO-Hardhat,NO-Mask,NO-Safety-Vest,Person,Safety-Cone,Safety-Vest,Machinery,Vehicle]
Additional
and i set workers in default.yaml