YOLOv7 Notebook: train.py RuntimeError

mo-traor3-ai commented 1 year ago

Search before asking

[X] I have searched the Roboflow Notebooks issues and found no similar bug report.

Notebook name

YOLOv7 PyTorch Object Detection: https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/train-yolov7-object-detection-on-custom-data.ipynb

My version: https://colab.research.google.com/drive/1Ky1iXpECpx8HJAIvnAXNhnqVy3rFixOi?usp=sharing

1BugReport_Mtraore_Training YOLOv7 on Custom Data` is the title for my version

Bug

YOLOv7 Notebook does not train with --device flag set to 0, 1, 2, or 3, but only when set to 'cpu'

This is a problem because when set to --device 'cpu', as the batch processing time becomes very long (close to an hour) - and the training is meant to run, and most efficient, on GPU. The --device flag is supposed to recognize the CUDA device. In this case, it is in fact --device 0, but this is what causes the train.py command to fail when you run it.

Here's a Loom video of the process, and walking through the lines of code the RuntimeError points out: https://www.loom.com/share/393ec489c1d54698ba9626c7e1be897b

Traceback (most recent call last):
  File "train.py", line 616, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 363, in train
    loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)  # loss scaled by batch_size
  File "/content/yolov7/utils/loss.py", line 585, in __call__
    bs, as_, gjs, gis, targets, anchors = self.build_targets(p, targets, imgs)
  File "/content/yolov7/utils/loss.py", line 759, in build_targets
    from_which_layer = from_which_layer[fg_mask_inboxes]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Environment

Google Colab Pro

NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2
A100-SXM4-40GB

Minimal Reproducible Example

# run this cell to begin training
%cd /content/yolov7
!python train.py --batch 16 --epochs 55 --data /content/yolov7/Face-Detection-15/data.yaml --weights 'yolov7_training.pt' --device 0

Additional

This is the error (and what was logged just before the error):

/usr/local/lib/python3.8/dist-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Model Summary: 415 layers, 37196556 parameters, 37196556 gradients, 105.1 GFLOPS

Transferred 557/566 items from yolov7_training.pt
Scaled weight_decay = 0.0005
Optimizer groups: 95 .bias, 95 conv.weight, 98 other
train: Scanning 'Face-Detection-15/train/labels.cache' images and labels... 2871 found, 0 missing, 539 empty, 0 corrupted: 100% 2871/2871 [00:00<?, ?it/s]
val: Scanning 'Face-Detection-15/valid/labels.cache' images and labels... 267 found, 0 missing, 50 empty, 0 corrupted: 100% 267/267 [00:00<?, ?it/s]

autoanchor: Analyzing anchors... anchors/target = 4.67, Best Possible Recall (BPR) = 0.9981
Image sizes 640 train, 640 test
Using 8 dataloader workers
Logging results to runs/train/exp4
Starting training for 55 epochs...

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
  0% 0/180 [00:07<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 616, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 363, in train
    loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)  # loss scaled by batch_size
  File "/content/yolov7/utils/loss.py", line 585, in __call__
    bs, as_, gjs, gis, targets, anchors = self.build_targets(p, targets, imgs)
  File "/content/yolov7/utils/loss.py", line 759, in build_targets
    from_which_layer = from_which_layer[fg_mask_inboxes]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

mo-traor3-ai commented 1 year ago

I sent a PR to YOLOv7 for this. Found the issue was in line 685 of loss.py and a missing line after line 756 in loss.py

https://github.com/WongKinYiu/yolov7/pull/1283

More from the forum and Stack Overflow on this:

# replacing the line in utils/loss.py line 685 to:
from_which_layer.append((torch.ones(size=(len(b),)) * i).to('cuda'))

# also add a line after 756:
fg_mask_inboxes = fg_mask_inboxes.to(torch.device('cuda'))

After testing train.py with the suggested fixes to loss.py, my training runs smoothly.

SkalskiP commented 1 year ago

@mo-traor3-ai it seems that at the moment you are convinced that the error does not lie on our side. Let me know about the fate of this PR.

mo-traor3-ai commented 1 year ago

@SkalskiP yes it definitely seems to stem from YOLOv7 repo itself, rather than our side. I will be sure to keep you up to date.

Leaving the Issue open for now for visibility for others too until resolved.

Jacobsolawetz commented 1 year ago

@mo-traor3-ai @SkalskiP fix to a downgraded torch version! Helped @mmcquade11 fix this one on his sagemaker notebook

Newer versions of torch initialize torch.zeros on the cpu by default --> epic fail haha

SkalskiP commented 1 year ago

@Jacobsolawetz oh! Do we know if they are going to fix that behavior? Or will they introduce a breaking change and stick with it in the future?

I took a look at the SeageMaker notebook. Is that the line we are talking about?

!pip install torch==1.12.1 torchvision==0.13.1 --ignore-installed

Rohan-Python commented 1 year ago

@mo-traor3-ai @SkalskiP fix to a downgraded torch version! Helped @mmcquade11 fix this one on his sagemaker notebook

Newer versions of torch initialize torch.zeros on the cpu by default --> epic fail haha

@Jacobsolawetz you mean the lower version of torch fixed this runtime error?

mo-traor3-ai commented 1 year ago

@mo-traor3-ai @SkalskiP fix to a downgraded torch version! Helped @mmcquade11 fix this one on his sagemaker notebook Newer versions of torch initialize torch.zeros on the cpu by default --> epic fail haha

@Jacobsolawetz you mean the lower version of torch fixed this runtime error?

@Rohan-Python Yes the lower version of torch fixes the error. The YOLOv7 repo has unresolved issues with the newer torch versions, due to the torch.zeros issue mentioned.

I sent a pull-request to the YOLOv7 repo to fully address the issue, but we will have to wait until they accept and merge it. For now, the lower version of torch, or directly editing the files as noted above would do the trick.

https://github.com/WongKinYiu/yolov7/pull/1283

SkalskiP commented 1 year ago

Here is another issue breaking YOLOv7 in Google Colab: https://github.com/WongKinYiu/yolov7/issues/1280. This time is related to the latest numpy.

SkalskiP commented 1 year ago

It doesn't look like the YOLOv7 team works on solving those issues. I forked their repository and fixed both issues here. We will use that code in out YOLOv7 tutorial notebook until they add those changes in their codebase.

SkalskiP commented 1 year ago

@mo-traor3-ai and @Rohan-Python please check if the current code in the YOLOv7 notebook works for you.

roboflow / notebooks