ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.16k stars 3.44k forks source link

Runtime error when training YOLOv3 #2002

Closed davendramaharaj1 closed 1 year ago

davendramaharaj1 commented 1 year ago

Search before asking

YOLOv3 Component

Training

Bug

Hi there,

I am trying to train yolov3 on a custom dataset locally. I am using my own RTX 3060 for training. I ran train.py with the following command:

!python3 train.py --data data/roboflow.data --epochs 150 --weights=<path_to_weights_dir>/yolov3-spp-ultralytics.pt

I have verified that the paths to my training and validation data are correct. On my Jupyter notebook, I see the following printed to the console:

Namespace(epochs=150, batch_size=16, accumulate=4, cfg='cfg/yolov3-spp.cfg', data='data/roboflow.data', multi_scale=False, img_size=[416], rect=False, resume=False, nosave=False, notest=False, evolve=False, bucket='', cache_images=False, weights=<path_to_weights_dir>/yolov3-spp-ultralytics.pt', name='', device='', adam=False, single_cls=False, var=None)
Using CUDA device0 _CudaDeviceProperties(name='NVIDIA GeForce RTX 3060', total_memory=12030MB)

WARNING: smart bias initialization failure.
WARNING: smart bias initialization failure.
WARNING: smart bias initialization failure.
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients
Caching labels (13299 found, 0 missing, 391 empty, 0 duplicate, for 13690 images
Caching labels (3302 found, 0 missing, 120 empty, 0 duplicate, for 3422 images):
Using 8 dataloader workers
Starting training for 150 epochs...

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
  0%|                                                   | 0/856 [00:00<?, ?it/s]/home/$USER/.local/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  0%|                                                   | 0/856 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "<path_to_yolo>/yolov3/train.py", line 433, in <module>
    train()  # train normally
  File "<path_to_yolo>/yolov3-pytorch/yolov3/train.py", line 275, in train
    loss, loss_items = compute_loss(pred, targets, model)
  File "<path_to_yolo>/yolov3-pytorch/yolov3/utils/utils.py", line 378, in compute_loss
    tcls, tbox, indices, anchor_vec = build_targets(model, targets)
  File "<path_to_yolo>/yolov3/utils/utils.py", line 474, in build_targets
    a = a[j]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

When I debugged line 474 at utils.py, I found that j was on 'cuda' while a was on the 'cpu'. I am not sure how this can be resolved. Please help asap.

Thanks! j_gpu a_cpu

Environment

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

github-actions[bot] commented 1 year ago

👋 Hello @davendramaharaj1, thank you for your interest in YOLOv3 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov3
$ cd yolov3
$ pip install -r requirements.txt

Environments

YOLOv3 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv3 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv3 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

github-actions[bot] commented 1 year ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv3 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv3 🚀 and Vision AI ⭐!

glenn-jocher commented 10 months ago

@davendramaharaj1 thanks for reaching out. The error message seems to indicate a device mismatch. It appears that the index tensor j is on the 'cuda' device, while the indexed tensor a is on the 'cpu' device, causing the RuntimeError. You can resolve this by ensuring that both j and a are on the same device before running the operation. Feel free to make this adjustment in the code and let me know if you encounter any further issues.

The YOLOv3 community and the Ultralytics team are always available to assist you.