Closed mo-traor3-ai closed 1 year ago
I sent a PR to YOLOv7 for this. Found the issue was in line 685
of loss.py
and a missing line after line 756
in loss.py
More from the forum and Stack Overflow on this:
# replacing the line in utils/loss.py line 685 to:
from_which_layer.append((torch.ones(size=(len(b),)) * i).to('cuda'))
# also add a line after 756:
fg_mask_inboxes = fg_mask_inboxes.to(torch.device('cuda'))
After testing train.py with the suggested fixes to loss.py
, my training runs smoothly.
@mo-traor3-ai it seems that at the moment you are convinced that the error does not lie on our side. Let me know about the fate of this PR.
@SkalskiP yes it definitely seems to stem from YOLOv7 repo itself, rather than our side. I will be sure to keep you up to date.
Leaving the Issue open for now for visibility for others too until resolved.
@mo-traor3-ai @SkalskiP fix to a downgraded torch version! Helped @mmcquade11 fix this one on his sagemaker notebook
Newer versions of torch initialize torch.zeros on the cpu by default --> epic fail haha
@Jacobsolawetz oh! Do we know if they are going to fix that behavior? Or will they introduce a breaking change and stick with it in the future?
I took a look at the SeageMaker notebook. Is that the line we are talking about?
!pip install torch==1.12.1 torchvision==0.13.1 --ignore-installed
@mo-traor3-ai @SkalskiP fix to a downgraded torch version! Helped @mmcquade11 fix this one on his sagemaker notebook
Newer versions of torch initialize torch.zeros on the cpu by default --> epic fail haha
@Jacobsolawetz you mean the lower version of torch fixed this runtime error?
@mo-traor3-ai @SkalskiP fix to a downgraded torch version! Helped @mmcquade11 fix this one on his sagemaker notebook Newer versions of torch initialize torch.zeros on the cpu by default --> epic fail haha
@Jacobsolawetz you mean the lower version of torch fixed this runtime error?
@Rohan-Python Yes the lower version of torch fixes the error. The YOLOv7 repo has unresolved issues with the newer torch versions, due to the torch.zeros
issue mentioned.
I sent a pull-request to the YOLOv7 repo to fully address the issue, but we will have to wait until they accept and merge it. For now, the lower version of torch, or directly editing the files as noted above would do the trick.
Here is another issue breaking YOLOv7 in Google Colab: https://github.com/WongKinYiu/yolov7/issues/1280. This time is related to the latest numpy
.
Search before asking
Notebook name
YOLOv7 PyTorch Object Detection: https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/train-yolov7-object-detection-on-custom-data.ipynb
My version: https://colab.research.google.com/drive/1Ky1iXpECpx8HJAIvnAXNhnqVy3rFixOi?usp=sharing
Bug
YOLOv7 Notebook does not train with
--device
flag set to0
,1
,2
, or3
, but only when set to'cpu'
This is a problem because when set to
--device 'cpu'
, as the batch processing time becomes very long (close to an hour) - and the training is meant to run, and most efficient, on GPU. The--device
flag is supposed to recognize the CUDA device. In this case, it is in fact--device 0
, but this is what causes thetrain.py
command to fail when you run it.Here's a Loom video of the process, and walking through the lines of code the
RuntimeError
points out: https://www.loom.com/share/393ec489c1d54698ba9626c7e1be897bEnvironment
Google Colab Pro
NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2
A100-SXM4-40GB
Minimal Reproducible Example
Additional
This is the error (and what was logged just before the error):
Are you willing to submit a PR?