ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.23k stars 3.45k forks source link

Issue in dataloader when training YOLOv3 #2075

Closed sadimanna closed 1 year ago

sadimanna commented 1 year ago

Search before asking

YOLOv3 Component

Training

Bug

In my attempt to train yolov3 on coco128 I ran into this issue:

Command: python ./yolov3/train.py --data ./yolov3/data/coco128.yaml --epochs 30 --weights '' --cfg ./yolov3/models/yolov3.yaml --batch-size -1 --workers 0

 Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    61949149       156.6         1.330         30.58         15.63        (1, 3, 640, 640)                    list
    61949149       313.2         1.988         17.61         17.31        (2, 3, 640, 640)                    list
    61949149       626.5         3.320         32.43         28.04        (4, 3, 640, 640)                    list
    61949149        1253         6.008         61.82         53.21        (8, 3, 640, 640)                    list
    61949149        2506        11.260           123         102.1       (16, 3, 640, 640)                    list
train: weights='', cfg=./yolov3/models/yolov3.yaml, data=./yolov3/data/coco128.yaml, hyp=yolov3\data\hyps\hyp.scratch-low.yaml, epochs=30, batch_size=-1, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=0, project=yolov3\runs\train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github:  YOLOv3 is out of date by 2662 commits. Use `git pull ultralytics master` or `git clone https://github.com/ultralytics/yolov5` to update.
YOLOv3  v9.6.0-86-g9a05787d Python-3.9.7 torch-1.10.1 CUDA:0 (NVIDIA RTX A5000, 24564MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv3  in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv3  runs in Comet
TensorBoard: Start with 'tensorboard --logdir yolov3\runs\train', view at http://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1       928  models.common.Conv                      [3, 32, 3, 1]                 
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     20672  models.common.Bottleneck                [64, 64]                      
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    164608  models.common.Bottleneck                [128, 128]                    
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  8   2627584  models.common.Bottleneck                [256, 256]                    
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  8  10498048  models.common.Bottleneck                [512, 512]                    
  9                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 2]             
 10                -1  4  20983808  models.common.Bottleneck                [1024, 1024]                  
 11                -1  1   5245952  models.common.Bottleneck                [1024, 1024, False]           
 12                -1  1    525312  models.common.Conv                      [1024, 512, 1, 1]             
 13                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 1]             
 14                -1  1    525312  models.common.Conv                      [1024, 512, 1, 1]             
 15                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 1]             
 16                -2  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 18           [-1, 8]  1         0  models.common.Concat                    [1]                           
 19                -1  1   1377792  models.common.Bottleneck                [768, 512, False]             
 20                -1  1   1312256  models.common.Bottleneck                [512, 512, False]             
 21                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 22                -1  1   1180672  models.common.Conv                      [256, 512, 3, 1]              
 23                -2  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 24                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 25           [-1, 6]  1         0  models.common.Concat                    [1]                           
 26                -1  1    344832  models.common.Bottleneck                [384, 256, False]             
 27                -1  2    656896  models.common.Bottleneck                [256, 256, False]             
 28      [27, 22, 15]  1    457725  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [256, 512, 1024]]
yolov3 summary: 262 layers, 61949149 parameters, 61949149 gradients, 156.6 GFLOPs

AMP: checks passed 
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA RTX A5000) 23.99G total, 0.62G reserved, 0.46G allocated, 22.90G free
AutoBatch: Using batch-size 26 for CUDA:0 18.99G/23.99G (79%) 
optimizer: SGD(lr=0.01) with parameter groups 72 weight(decay=0.0), 75 weight(decay=0.00040625000000000004), 75 bias

train: Scanning C:\Users\ISI_UTS\Siladittya\MIDA2023\datasets\coco128\labels\train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|##########| 128/128 [00:00<?, ?it/s]
train: Scanning C:\Users\ISI_UTS\Siladittya\MIDA2023\datasets\coco128\labels\train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|##########| 128/128 [00:00<?, ?it/s]

val: Scanning C:\Users\ISI_UTS\Siladittya\MIDA2023\datasets\coco128\labels\train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|##########| 128/128 [00:00<?, ?it/s]
val: Scanning C:\Users\ISI_UTS\Siladittya\MIDA2023\datasets\coco128\labels\train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|##########| 128/128 [00:00<?, ?it/s]

AutoAnchor: 4.27 anchors/target, 0.994 Best Possible Recall (BPR). Current anchors are a good fit to dataset 
Plotting labels to yolov3\runs\train\exp5\labels.jpg... 
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to yolov3\runs\train\exp5
Starting training for 30 epochs...

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size

  0%|          | 0/5 [00:00<?, ?it/s]
  0%|          | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\ISI_UTS\Siladittya\MIDA2023\yolov3\train.py", line 637, in <module>
    main(opt)
  File "C:\Users\ISI_UTS\Siladittya\MIDA2023\yolov3\train.py", line 530, in main
    train(opt.hyp, opt, device, callbacks)
  File "C:\Users\ISI_UTS\Siladittya\MIDA2023\yolov3\train.py", line 285, in train
    for i, (imgs, targets, paths, _) in pbar:  # batch -------------------------------------------------------------
  File "C:\Users\ISI_UTS\anaconda3\lib\site-packages\tqdm\std.py", line 1178, in __iter__
    for obj in iterable:
  File "C:\Users\ISI_UTS\Siladittya\MIDA2023\yolov3\utils\dataloaders.py", line 172, in __iter__
    yield next(self.iterator)
  File "C:\Users\ISI_UTS\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
    data = self._next_data()
  File "C:\Users\ISI_UTS\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 560, in _next_data
    index = self._next_index()  # may raise StopIteration
  File "C:\Users\ISI_UTS\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 512, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "C:\Users\ISI_UTS\Siladittya\MIDA2023\yolov3\utils\dataloaders.py", line 187, in __iter__
    yield from iter(self.sampler)
  File "C:\Users\ISI_UTS\anaconda3\lib\site-packages\torch\utils\data\sampler.py", line 229, in __iter__
    for idx in self.sampler:
  File "C:\Users\ISI_UTS\anaconda3\lib\site-packages\torch\utils\data\sampler.py", line 126, in __iter__
    yield from torch.randperm(n, generator=generator, device='cuda').tolist()
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

I tried adding device = 'cuda' to generator = torch.Generator() (Line 143) in /utils/dataloader.py. But then I started getting another error

train: Scanning C:\Users\ISI_UTS\Siladittya\MIDA2023\datasets\coco128\labels\train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|##########| 128/128 [00:00<?, ?it/s]
train: Scanning C:\Users\ISI_UTS\Siladittya\MIDA2023\datasets\coco128\labels\train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|##########| 128/128 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\ISI_UTS\Siladittya\MIDA2023\yolov3\train.py", line 637, in <module>
    main(opt)
  File "C:\Users\ISI_UTS\Siladittya\MIDA2023\yolov3\train.py", line 530, in main
    train(opt.hyp, opt, device, callbacks)
  File "C:\Users\ISI_UTS\Siladittya\MIDA2023\yolov3\train.py", line 189, in train
    train_loader, dataset = create_dataloader(train_path,
  File "C:\Users\ISI_UTS\Siladittya\MIDA2023\yolov3\utils\dataloaders.py", line 145, in create_dataloader
    return loader(dataset,
  File "C:\Users\ISI_UTS\Siladittya\MIDA2023\yolov3\utils\dataloaders.py", line 165, in __init__
    self.iterator = super().__iter__()
  File "C:\Users\ISI_UTS\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 359, in __iter__
    return self._get_iterator()
  File "C:\Users\ISI_UTS\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 302, in _get_iterator
    return _SingleProcessDataLoaderIter(self)
  File "C:\Users\ISI_UTS\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 552, in __init__
    super(_SingleProcessDataLoaderIter, self).__init__(loader)
  File "C:\Users\ISI_UTS\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 498, in __init__
    self._base_seed = torch.empty((), dtype=torch.int64).random_(generator=loader.generator).item()
RuntimeError: Expected a 'cpu' device type for generator but found 'cuda'

Environment

Minimal Reproducible Example

python ./yolov3/train.py --data ./yolov3/data/coco128.yaml --epochs 30 --weights '' --cfg ./yolov3/models/yolov3.yaml  --batch-size -1 --workers 0

Additional

No response

Are you willing to submit a PR?

github-actions[bot] commented 1 year ago

👋 Hello @sadimanna, thank you for your interest in YOLOv3 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv3 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv3 CI

If this badge is green, all YOLOv3 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv3 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics
sadimanna commented 1 year ago

I am trying to train a YOLOv3, not a YOLOv5.

I get the same error when using the codes from ultralytics/yolov5 repo.

A thread in StackOverflow says, turning shuffle off, that is, setting shuffle = False, solves the issue. But it affects performance.

Is there any other way around for this issue?

glenn-jocher commented 1 year ago

@sadimanna hello,

Thank you for your question. To fix this issue, we recommend setting torch.backends.cudnn.enabled to True. Another solution is to set the batch size to 1, or to use a smaller dataset for training.

Please let us know if you have any other questions or concerns.

Best,

sadimanna commented 1 year ago

Hi @glenn-jocher

Setting torch.backends.cudnn.enabled = True in train.py did not work, unfortunately. Also, reducing the batch-size to 1 gives the same error.

I set generator = None and commented out the following line from dataloaders.py and now it is working fine, with torch.backends.cudnn.enabled still set to True.

P.S.: I have a single GPU system. Does that make any difference?

glenn-jocher commented 1 year ago

@sadimanna I would try to start from one of the examples in i.e. our Colab notebook. Once this works for you then you can train on your data instead of COCO128. See https://colab.research.google.com/github/ultralytics/yolov3/blob/master/tutorial.ipynb?hl=en

github-actions[bot] commented 1 year ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐