ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.39k stars 16.26k forks source link

Different crashes on linux VM and windows PC when running the Train Custom Data tutorial for yolo5 #2108

Closed davesargrad closed 3 years ago

davesargrad commented 3 years ago

❔Question

I am running the "Train Custom Data" tutorial, for the first time. I see a "terminated" condition as depicted below.

I imagine that this is not to be expected. Can you guide me relative to isolating the issue. I am running this on a Fedora VM. Though a GPU is not available (as the screenshot below indicates), I'd think that the training session still runs using the CPU.

  1. Could the lack of a GPU be causing what appears to be a premature termination?
  2. What output should I see on a successful run?

Additional context

image

image

image

github-actions[bot] commented 3 years ago

👋 Hello @davesargrad, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

davesargrad commented 3 years ago

image

davesargrad commented 3 years ago

image

davesargrad commented 3 years ago

It seems to crash somewhere within the call to model (i dug in and its in forward_once).. i'll try to exactly isolate.

image

davesargrad commented 3 years ago

It crashes at line 37... it hits this line probably about 20 times (after the call to model, at line 289 above) before crashing.

image

davesargrad commented 3 years ago

Gonna take a break for now.. hopefully you guys will have something to suggest.. this last screenshot includes a debug counter that shows a consistent crash after the 28th or 29th call to run the model

By the way, thank you for all the great work on this component. I am excited to learn more from it, and from you guys.

image

davesargrad commented 3 years ago

Hi. Though I would like to get this working on my linux VM, I'm trying as well on my windows box. There I get a completely different error as follows:

image

davesargrad commented 3 years ago

I managed to get the windows version running. I had to increase the size of my paging file for some reason.

I'd still very much like to get the linux VM version also working.

image

davesargrad commented 3 years ago

I do see a GPU related error on the run on windows. (cudart64_101.dll not found)

image

glenn-jocher commented 3 years ago

@davesargrad tensorboard seems to produce the cudart error on all operating systems, we investigated but no fix is currently available. It doesn't seem to have any detrimental effect so I would just ignore it. You can see it in the Colab tutorial also here: https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb#scrollTo=1NcFxRcFdJ_O&line=1&uniqifier=1

About your OS issues, YOLOv5 is supported on all 3 main operating systems. I don't have time to help you diagnose, but I'll paste our default environmental issue reply below.

It appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt again. We also highly recommend using one of our verified environments below.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

glenn-jocher commented 3 years ago

@davesargrad BTW, line 37 there in common.py is just the forward function of the basic Conv() module, which is the bread and butter of the CNN architecture more or less, there's nothing with the code there. You have the latest code clearly (green check mark), so it's likely there may be something in your environment preventing correct operation. Windows is always a difficult environment unfortunately (I code on a Mac and then train on Ubuntu instances). You can run the same tutorial steps in colab to see the correct response: https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb

davesargrad commented 3 years ago

@glenn-jocher Ty very much for your kind response. I will continue to dig, however for now I seem to be making progress on a windows platform.

Can you please tell me why you think the following detect is not actually detecting. Here I am using the bus.jpg and the zidane.jpg that are part of the yolo5 release. I am comparing my outputs to the outputs on the colab tutorial documentation page.. they are slightly different, though my 224 layer model with 7266973 parameters is clearly proper.

image

What I expect to see is this:

Fusing layers... 
Model Summary: 224 layers, 7266973 parameters, 0 gradients, 17.0 GFLOPS
image 1/2 /content/yolov5/data/images/bus.jpg: 640x480 4 persons, 1 buss, 1 skateboards, Done. (0.011s)
image 2/2 /content/yolov5/data/images/zidane.jpg: 384x640 2 persons, 2 ties, Done. (0.011s)
Results saved to runs/detect/exp..

In the resulting runs/detect/exp directory, I see the two images, but they do not include the bounding boxes with a (person, bus, skateboard) identifier.

The NaN's (nan) that come out of the inference dont seem good.

image

The 25 sequentials that are processed inside forward_once, result in either None.. or tensors that include some nan values: image

These same tensors also include some seemingly good data: image

tensor([[[[     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          ...,
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan]],

         [[ 0.50391,  0.50391,  0.50391,  ...,  0.50391,  0.50391,  0.50391],
          [ 0.50391,  0.50391,  0.50391,  ...,  0.50391,  0.50391,  0.50391],
          [ 0.50391,  0.50391,  0.50391,  ...,  0.50391,  0.50391,  0.50391],
          ...,
          [ 0.50391,  0.50391,  0.50391,  ...,  0.50391,  0.50391,  0.50391],
          [ 0.50391,  0.50391,  0.50391,  ...,  0.50391,  0.50391,  0.50391],
          [ 0.50391,  0.50391,  0.50391,  ...,  0.50391,  0.50391,  0.50391]],

         [[     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          ...,
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan]],

         ...,

         [[ 1.90039,  1.90039,  1.90039,  ...,  1.90039,  1.90039,  1.90039],
          [ 1.90039,  1.90039,  1.90039,  ...,  1.90039,  1.90039,  1.90039],
          [ 1.90039,  1.90039,  1.90039,  ...,  1.90039,  1.90039,  1.90039],
          ...,
          [ 1.90039,  1.90039,  1.90039,  ...,  1.90039,  1.90039,  1.90039],
          [ 1.90039,  1.90039,  1.90039,  ...,  1.90039,  1.90039,  1.90039],
          [ 1.90039,  1.90039,  1.90039,  ...,  1.90039,  1.90039,  1.90039]],

         [[     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          ...,
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
          [     nan,      nan,      nan,  ...,      nan,      nan,      nan]],

         [[-0.13367, -0.13367, -0.13367,  ..., -0.13367, -0.13367, -0.13367],
          [-0.13367, -0.13367, -0.13367,  ..., -0.13367, -0.13367, -0.13367],
          [-0.13367, -0.13367, -0.13367,  ..., -0.13367, -0.13367, -0.13367],
          ...,
          [-0.13367, -0.13367, -0.13367,  ..., -0.13367, -0.13367, -0.13367],
          [-0.13367, -0.13367, -0.13367,  ..., -0.13367, -0.13367, -0.13367],
          [-0.13367, -0.13367, -0.13367,  ..., -0.13367, -0.13367, -0.13367]]]], device='cuda:0', dtype=torch
glenn-jocher commented 3 years ago

@davesargrad the NaNs must be the source of the problem. I think this connected to CUDA version incompatibilities on Windows, there are a few issues open on this actually, i.e. https://github.com/ultralytics/yolov5/issues/1625. I think one user said downgrading to CUDA 10.1 solved this for him.

I'm assuming detect.py --device cpu works correctly?

glenn-jocher commented 3 years ago

The Windows CI tests (on CPU) show correct detections here, both for custom trained models and official auto-downloaded models. These tests run every 24 hours. https://github.com/ultralytics/yolov5/runs/1803125015?check_suite_focus=true

Screen Shot 2021-02-01 at 12 33 45 PM
davesargrad commented 3 years ago

@glenn-jocher as you thought.. when i use the CPU I get proper identification.

image

image

Thanks very much. Assuming that I continue to isolate environment problems, I will be sure to post what I learn here. Hopefully that will help contribute to your growing knowledge base.

wudashuo commented 3 years ago

Yes, there is a problem with detecting on Windows, and I raised an issue about this. It seems like something wrong with the PyTorch 1.7 Windows version. And the Linux version works fine. You can find more details in my issue #1625

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.