ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.23k stars 3.45k forks source link

Issue with training YOLOv3-tiny from scratch #2200

Closed jackfaubshner closed 7 months ago

jackfaubshner commented 7 months ago

Search before asking

YOLOv3 Component

No response

Bug

Hello,

I am trying to train YOLOv3-tiny from scratch for a small research project but I seem to have run into some weird issue

I have made no changes to the code whatsoever

I first tried it on a training workstation I have, which has the following specs:

CPU: AMD Ryzen Threadripper PRO 3955WX 16-Cores GPU: 3 x NVIDIA RTX A6000 49140MiB RAM: 256GB OS: Ubuntu 20.04.4 LTS Python: 3.8.10 CUDA Version: 12.0 torch: 2.2.1

I made sure to run requirements.txt to make sure all packages are updated

I cloned the repository, made no changes to it and I directly ran the following command and this is its subsequent output:

Test1@lambda-quad:~/Test1/PyTorch/yolov3$ python3 train.py --data coco.yaml --epochs 300 --weight '' --cfg yolov3-tiny.yaml --batch-size 128
train: weights=, cfg=yolov3-tiny.yaml, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=128, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), 2.40 KiB | 2.40 MiB/s, done.
From https://github.com/ultralytics/yolov5
 * [new branch]        snyk-fix-4e8d678da4d79a191be19b54afdad920 -> ultralytics/snyk-fix-4e8d678da4d79a191be19b54afdad920
github: ⚠️ YOLOv3 is out of date by 2779 commits. Use 'git pull ultralytics master' or 'git clone https://github.com/ultralytics/yolov5' to update.
YOLOv3 🚀 v9.6.0-168-gcff02836 Python-3.8.10 torch-2.2.1+cu121 CUDA:0 (NVIDIA RTX A6000, 48674MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv3 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments
  0                -1  1       464  models.common.Conv                      [3, 16, 3, 1]
  1                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]
  2                -1  1      4672  models.common.Conv                      [16, 32, 3, 1]
  3                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]
  4                -1  1     18560  models.common.Conv                      [32, 64, 3, 1]
  5                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]
  6                -1  1     73984  models.common.Conv                      [64, 128, 3, 1]
  7                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]
  8                -1  1    295424  models.common.Conv                      [128, 256, 3, 1]
  9                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]
 10                -1  1   1180672  models.common.Conv                      [256, 512, 3, 1]
 11                -1  1         0  torch.nn.modules.padding.ZeroPad2d      [[0, 1, 0, 1]]
 12                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 1, 0]
 13                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 1]
 14                -1  1    262656  models.common.Conv                      [1024, 256, 1, 1]
 15                -1  1   1180672  models.common.Conv                      [256, 512, 3, 1]
 16                -2  1     33024  models.common.Conv                      [256, 128, 1, 1]
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 18           [-1, 8]  1         0  models.common.Concat                    [1]
 19                -1  1    885248  models.common.Conv                      [384, 256, 3, 1]
 20          [19, 15]  1    196350  models.yolo.Detect                      [80, [[10, 14, 23, 27, 37, 58], [81, 82, 135, 169, 344, 319]], [256, 512]]
yolov3-tiny summary: 49 layers, 8852366 parameters, 8852366 gradients, 13.3 GFLOPs

Fusing layers...
Segmentation fault (core dumped)

So I looked around the issues section and saw a few people mention I should try running train.py without any parameters to see how it runs as default, so below is that happened:

Test1@lambda-quad:~/Test1/PyTorch/Temp/yolov3$ python3 train.py
train: weights=yolov3-tiny.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: ⚠️ YOLOv3 is out of date by 2779 commits. Use 'git pull ultralytics master' or 'git clone https://github.com/ultralytics/yolov5' to update.
YOLOv3 🚀 v9.6.0-168-gcff02836 Python-3.8.10 torch-2.2.1+cu121 CUDA:0 (NVIDIA RTX A6000, 48674MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv3 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments
  0                -1  1       464  models.common.Conv                      [3, 16, 3, 1]
  1                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]
  2                -1  1      4672  models.common.Conv                      [16, 32, 3, 1]
  3                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]
  4                -1  1     18560  models.common.Conv                      [32, 64, 3, 1]
  5                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]
  6                -1  1     73984  models.common.Conv                      [64, 128, 3, 1]
  7                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]
  8                -1  1    295424  models.common.Conv                      [128, 256, 3, 1]
  9                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]
 10                -1  1   1180672  models.common.Conv                      [256, 512, 3, 1]
 11                -1  1         0  torch.nn.modules.padding.ZeroPad2d      [[0, 1, 0, 1]]
 12                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 1, 0]
 13                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 1]
 14                -1  1    262656  models.common.Conv                      [1024, 256, 1, 1]
 15                -1  1   1180672  models.common.Conv                      [256, 512, 3, 1]
 16                -2  1     33024  models.common.Conv                      [256, 128, 1, 1]
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 18           [-1, 8]  1         0  models.common.Concat                    [1]
 19                -1  1    885248  models.common.Conv                      [384, 256, 3, 1]
 20          [19, 15]  1    196350  models.yolo.Detect                      [80, [[10, 14, 23, 27, 37, 58], [81, 82, 135, 169, 344, 319]], [256, 512]]
Model summary: 49 layers, 8852366 parameters, 8852366 gradients, 13.3 GFLOPs

Transferred 71/71 items from yolov3-tiny.pt
Fusing layers...
Segmentation fault (core dumped)

Then I thought it might be an issue with the workstation, so I fresh installed Ubuntu 22.04.4 LTS on a laptop (CPU only, no GPU), cloned the repo and this time, I first ran train.py without any parameters. It started training the model (don't have output of this one as I did not save it). I cancelled it with Ctrl + C after 10 minutes

Laptop Specs: CPU: Intel i5-5200U GPU: None RAM: 8GB OS: Ubuntu 22.04.4 LTS Python: 3.10.12 torch: 2.2.1

I made sure to run requirements.txt to make sure all packages are updated

Then, I ran "python3 train.py --data coco.yaml --epochs 300 --weight '' --cfg yolov3-tiny.yaml --batch-size 128" and again, it crashed but this time it just says "Killed". Below is the output

test@test-Flex-3-1570:~/temp/yolov3$ python3 train.py --data coco.yaml --epochs 300 --weights '' --cfg yolov3-tiny.yaml  --batch-size 128
train: weights=, cfg=yolov3-tiny.yaml, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=128, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: ⚠️ YOLOv3 is out of date by 2779 commits. Use 'git pull ultralytics master' or 'git clone https://github.com/ultralytics/yolov5' to update.
YOLOv3 🚀 v9.6.0-168-gcff02836 Python-3.10.12 torch-2.2.2+cu121 CPU

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv3 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

Dataset not found ⚠️, missing paths ['/home/test/temp/datasets/coco/val2017.txt']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco2017labels.zip to /home/test/temp/datasets/coco2017labels.zip...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46.4M/46.4M [00:05<00:00, 9.36MB/s]
Unzipping /home/test/temp/datasets/coco2017labels.zip...
Downloading http://images.cocodataset.org/zips/train2017.zip to /home/test/temp/datasets/coco/images/train2017.zip...
Downloading http://images.cocodataset.org/zips/val2017.zip to /home/test/temp/datasets/coco/images/val2017.zip...
Downloading http://images.cocodataset.org/zips/test2017.zip to /home/test/temp/datasets/coco/images/test2017.zip...
Unzipping /home/test/temp/datasets/coco/images/val2017.zip...
Unzipping /home/test/temp/datasets/coco/images/test2017.zip...
Unzipping /home/test/temp/datasets/coco/images/train2017.zip...
Dataset download success ✅ (2836.0s), saved to /home/test/temp/datasets

                 from  n    params  module                                  arguments                     
  0                -1  1       464  models.common.Conv                      [3, 16, 3, 1]                 
  1                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]                     
  2                -1  1      4672  models.common.Conv                      [16, 32, 3, 1]                
  3                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]                     
  4                -1  1     18560  models.common.Conv                      [32, 64, 3, 1]                
  5                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]                     
  6                -1  1     73984  models.common.Conv                      [64, 128, 3, 1]               
  7                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]                     
  8                -1  1    295424  models.common.Conv                      [128, 256, 3, 1]              
  9                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]                     
 10                -1  1   1180672  models.common.Conv                      [256, 512, 3, 1]              
 11                -1  1         0  torch.nn.modules.padding.ZeroPad2d      [[0, 1, 0, 1]]                
 12                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 1, 0]                     
 13                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 1]             
 14                -1  1    262656  models.common.Conv                      [1024, 256, 1, 1]             
 15                -1  1   1180672  models.common.Conv                      [256, 512, 3, 1]              
 16                -2  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 18           [-1, 8]  1         0  models.common.Concat                    [1]                           
 19                -1  1    885248  models.common.Conv                      [384, 256, 3, 1]              
 20          [19, 15]  1    196350  models.yolo.Detect                      [80, [[10, 14, 23, 27, 37, 58], [81, 82, 135, 169, 344, 319]], [256, 512]]
yolov3-tiny summary: 49 layers, 8852366 parameters, 8852366 gradients, 13.3 GFLOPs

optimizer: SGD(lr=0.01) with parameter groups 11 weight(decay=0.0), 13 weight(decay=0.001), 13 bias
train: Scanning /home/test/temp/datasets/coco/train2017... 117266 images, 1021 backgrounds, 0 corrupt: 100%|██████████| 118287/118287 [01:14<00:00, 1597.25it/s]
train: WARNING ⚠️ /home/test/temp/datasets/coco/images/train2017/000000099844.jpg: 2 duplicate labels removed
train: WARNING ⚠️ /home/test/temp/datasets/coco/images/train2017/000000201706.jpg: 1 duplicate labels removed
train: WARNING ⚠️ /home/test/temp/datasets/coco/images/train2017/000000214087.jpg: 1 duplicate labels removed
train: WARNING ⚠️ /home/test/temp/datasets/coco/images/train2017/000000522365.jpg: 1 duplicate labels removed
train: New cache created: /home/test/temp/datasets/coco/train2017.cache
val: Scanning /home/test/temp/datasets/coco/val2017... 4952 images, 48 backgrounds, 0 corrupt: 100%|██████████| 5000/5000 [00:08<00:00, 586.57it/s]
val: New cache created: /home/test/temp/datasets/coco/val2017.cache

AutoAnchor: 2.93 anchors/target, 0.992 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/exp2/labels.jpg... 
Image sizes 640 train, 640 val
Using 4 dataloader workers
Logging results to runs/train/exp2
Starting training for 300 epochs...

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
  0%|          | 0/925 [00:00<?, ?it/s]Killed

Then I ran train.py without parameters (which automatically defaulted to YOLOv3-tiny and the following happened:

test@test-Flex-3-1570:~/temp/yolov3$ python3 train.py 
train: weights=yolov3-tiny.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: ⚠️ YOLOv3 is out of date by 2779 commits. Use 'git pull ultralytics master' or 'git clone https://github.com/ultralytics/yolov5' to update.
YOLOv3 🚀 v9.6.0-168-gcff02836 Python-3.10.12 torch-2.2.2+cu121 CPU

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv3 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1       464  models.common.Conv                      [3, 16, 3, 1]                 
  1                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]                     
  2                -1  1      4672  models.common.Conv                      [16, 32, 3, 1]                
  3                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]                     
  4                -1  1     18560  models.common.Conv                      [32, 64, 3, 1]                
  5                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]                     
  6                -1  1     73984  models.common.Conv                      [64, 128, 3, 1]               
  7                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]                     
  8                -1  1    295424  models.common.Conv                      [128, 256, 3, 1]              
  9                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 2, 0]                     
 10                -1  1   1180672  models.common.Conv                      [256, 512, 3, 1]              
 11                -1  1         0  torch.nn.modules.padding.ZeroPad2d      [[0, 1, 0, 1]]                
 12                -1  1         0  torch.nn.modules.pooling.MaxPool2d      [2, 1, 0]                     
 13                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 1]             
 14                -1  1    262656  models.common.Conv                      [1024, 256, 1, 1]             
 15                -1  1   1180672  models.common.Conv                      [256, 512, 3, 1]              
 16                -2  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 18           [-1, 8]  1         0  models.common.Concat                    [1]                           
 19                -1  1    885248  models.common.Conv                      [384, 256, 3, 1]              
 20          [19, 15]  1    196350  models.yolo.Detect                      [80, [[10, 14, 23, 27, 37, 58], [81, 82, 135, 169, 344, 319]], [256, 512]]
Model summary: 49 layers, 8852366 parameters, 8852366 gradients, 13.3 GFLOPs

Transferred 71/71 items from yolov3-tiny.pt
optimizer: SGD(lr=0.01) with parameter groups 11 weight(decay=0.0), 13 weight(decay=0.0005), 13 bias
train: Scanning /home/test/temp/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
val: Scanning /home/test/temp/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]

AutoAnchor: 2.86 anchors/target, 0.988 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/exp3/labels.jpg... 
Image sizes 640 train, 640 val
Using 4 dataloader workers
Logging results to runs/train/exp3
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       0/99         0G    0.06398     0.1862    0.02532        197        640:  12%|█▎        | 1/8 [00:13<01:36, 13.72s/it]Killed

I'm not sure what is wrong but it feels like this a YOLOv3-tiny training issue?

Apologies for not making a pull request, I don't want to mess things up

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

github-actions[bot] commented 7 months ago

👋 Hello @jackfaubshner, thank you for your interest in YOLOv3 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov3  # clone
cd yolov3
pip install -r requirements.txt  # install

Environments

YOLOv3 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv3 CI

If this badge is green, all YOLOv3 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv3 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics
glenn-jocher commented 7 months ago

@jackfaubshner hello!

Thank you for the detailed issue description. It seems like you're encountering two main problems when training YOLOv3-tiny: segmentation faults on powerful equipment and the "Killed" message on your CPU-only setup.

  1. Segmentation Faults: This issue frequently relates to environment-specific constraints rather than the model itself. Ensure your PyTorch and CUDA versions are compatible. Also, try reducing the batch size to see if it alleviates the problem.

  2. "Killed" Message: This typically happens due to an out-of-memory error, especially on systems with limited resources like your CPU-only laptop. The training process requires a considerable amount of RAM, and when you increase your batch size or your system runs out of memory, the OS might terminate the process. Try reducing the --batch-size (e.g., to 16 or 32) and see if it solves the issue.

Lastly, it's essential to keep your repository up to date, as mentioned in your logs. Though the message points towards cloning YOLOv5, it's just about ensuring your YOLOv3 version is current. For detailed investigations and advanced troubleshooting, consult the documentation at https://docs.ultralytics.com.

Keep in mind, the YOLO community and we at Ultralytics are here to help, and we appreciate your contribution to making YOLOv3 better! 🚀

jackfaubshner commented 7 months ago

@jackfaubshner hello!

Thank you for the detailed issue description. It seems like you're encountering two main problems when training YOLOv3-tiny: segmentation faults on powerful equipment and the "Killed" message on your CPU-only setup.

1. **Segmentation Faults**: This issue frequently relates to environment-specific constraints rather than the model itself. Ensure your PyTorch and CUDA versions are compatible. Also, try reducing the batch size to see if it alleviates the problem.

2. **"Killed" Message**: This typically happens due to an out-of-memory error, especially on systems with limited resources like your CPU-only laptop. The training process requires a considerable amount of RAM, and when you increase your batch size or your system runs out of memory, the OS might terminate the process. Try reducing the `--batch-size` (e.g., to 16 or 32) and see if it solves the issue.

Lastly, it's essential to keep your repository up to date, as mentioned in your logs. Though the message points towards cloning YOLOv5, it's just about ensuring your YOLOv3 version is current. For detailed investigations and advanced troubleshooting, consult the documentation at https://docs.ultralytics.com.

Keep in mind, the YOLO community and we at Ultralytics are here to help, and we appreciate your contribution to making YOLOv3 better! 🚀

Thank you kind sir, I was able to get it to work on my old laptop, not that I am going to train on it, just wanna check if the code works. It would probably take more time to train on that laptop than the heat death of the universe

Also, yes, the issue with the workstation is probably with CUDA. It has CUDA 12.0 which no version of PyTorch supports

I'm gonna close this issue but is there any parameter I can add to the command below to train it on CPU only? Cause I don't think I can change the CUDA version on this workstation as other people are using it.

python3 train.py --data coco.yaml --epochs 300 --weight '' --cfg yolov3-tiny.yaml --batch-size 128
glenn-jocher commented 7 months ago

@jackfaubshner, great to hear you got it working on your laptop, even if just for a test! Regarding training on the CPU, you can indeed run your training on a CPU by specifying the device. Just add --device cpu to your command like so:

python3 train.py --data coco.yaml --epochs 300 --weights '' --cfg yolov3-tiny.yaml --batch-size 128 --device cpu

This tells the script to ignore any GPUs and run the training process on the CPU only. Keep in mind, as you've probably guessed, training on a CPU is significantly slower than on GPUs. 😊

Should you have any more questions or run into issues, feel free to ask. Happy training! 🚀