I am getting nan and no predictions at all.

LightCannon commented 2 years ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hello Everyone, I am new to yoloV5 and I I have problem cannot figure its cause. I am training with custom dataset (I am trying using low epochs first), but what I am getting is that box and obj are nan. Also, the no detections appear on validation images.

I have used this command to train: python train.py --img 412 --batch 2 --epochs 2 --data people.yaml --cfg models\yolov5s.yaml --name pm1 --workers 6

There is an issue here also discussing same problem. However, the comments are towards the environment problems which I cannot still figure what is the problem. Here is my environment:

Windows 10 16 GB ram
NVIDIA GeForce GTX 1660 Ti, 6144MiB
Cuda 11.3
Python 3.8
torch==1.10.0
torchaudio==0.10.0
torchvision==0.11.1

and I am working on this dataset: https://github.com/ucuapps/top-view-multi-person-tracking

I appreciate any help regarding fixing this problem and getting it work well. Thanks

Additional

No response

github-actions[bot] commented 2 years ago

👋 Hello @LightCannon, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 2 years ago

@LightCannon this might be a windows/conda/CUDA11 bug that PyTorch has as mentioned in some other issues, in which case downgrading to CUDA 10 would solve this.

Or you may have some problems with your dataset labels. Check your mosaic jpgs to ensure your labels are correct and follow the instructions here: https://docs.ultralytics.com/yolov5/tutorials/train_custom_data

LightCannon commented 2 years ago

I have downgraded to CUDA 10.2 and you are right, this is a bug from CUDA 11.3 and everything works now with CUDA 10.2. Thanks for your help.

Zengyf-CVer commented 2 years ago

@LightCannon Because I did not see the screenshot of your virtual environment, I guess you installed PyTorch through pip install torchvison. If you want to install Cuda 11.x, you can try to enter pip3 install torch==1.10.0 in the official website +cu113 torchvision==0.11.1+cu113 torchaudio===0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html to install, as shown in the figure: 000 In addition, for Cuda11.x, it has something to do with the graphics card model you are using. Some graphics cards are very friendly to Cuda11.x.

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

ozett commented 2 years ago

glad that i found that here, because i trained 100 epochs and was wondering about no predictions...

on ubuntu 20.04 with gforce 1660 and nvidia-run-driver (which gave me CUDA).

looks like a bug in 11.4 and i downgraded to 11.3... still looking for a way to go further down...

ozett commented 2 years ago

Hopefully 11.1 has no nan-bug, downgrade to 10.2 on ubuntu 20.04 seems difficult..

not to forget adjusting torch-install afterwards, i guess... 👻

ozett commented 2 years ago

cuda 11.3 seems to nan-nan

python train.py --img 412 --batch 2 --epochs 2

ozett commented 2 years ago

related, unsolved: https://docs.ultralytics.com/yolov5/tutorials/hyperparameter_evolution1 related, downgrade to 10.2 solved it: https://github.com/ultralytics/yolov5/issues/4084 related, downgrade to 10.2 solved it: https://github.com/ultralytics/yolov5/issues/4839

https://www.codestudyblog.com/cs2112pyc/1230044131.html says, that CUDNN hat problems with Gforce 16xx... will try to mess with CUDNN to see if that fixes this...

https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-822

ozett commented 2 years ago

https://stackoverflow.com/questions/31326015/how-to-verify-cudnn-installation

looks like CUDNN is missing on my system. maybe thats the whole problem?

ozett commented 2 years ago

wow, now some hours of driver re-install, but pytorch 1.6 is the solution?

https://github.com/ultralytics/yolov5/issues/1749

--- long way to go....

ozett commented 2 years ago

solved. install specific pytorch version, with min-requirements for yolov5 and for cuda10.x (install pytoch cuda10.2 even if v11.x is on your system. that looks like the solution)

https://pytorch.org/get-started/previous-versions/

pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

my system is now filled up with unwanted packages from all nvidia-experiments. i will have to set it up from scratch again.

maybe it only works of mismatch of cuda-packages 10 & 11 ?

i have to cross-check this with a fresh install of ubuntu , minimal nvidia driver and cuda 11, and than a pytorch-version explicitly for cuda10

glenn-jocher commented 2 years ago

@ozett thanks for the feedback! Good to know CUDA 11.6 with driver 510.39.01 and torch==1.7.1+cu101 work well with consumer cards.

ozett commented 2 years ago

thanks for the encouragement. i am also thankful that you have an eye on almost all processes and issues here. really great. even when i rehash old stuff here. great. but you should also sleep once in a while ... :-)

this case must be special with Geforce 16xx cards.

i have to cross-check the next days on a fresh ubuntu system 20.x if newest nvidida-driver and newest cuda 11.6 are sufficient for older versions of pytorch/cuda combinations and thus fix this "nan-nan" error on the geForce 16xx-card.

edit: also i want the trained model to run on another install, this has some other combination of pytorch installed without GPU YOLOv5 🚀 v5.0-455-g59aae85 torch 1.9.1+cu102 CPU that causes runtime-errors. i have to sort out wich versions are compatible to transfer the trained model to another system. maybe some combinations will work...

that will take some time ... and i will report here the results briefly..

ozett commented 2 years ago

TESTED with fresh install: despite of the really installed cuda-version on ubuntu-os, you must download and install the cuda10.2 version for pytorch.

that worked and fixed the nan-nan error.

detailed testrun:

#FIRST: install ubuntu 20.04.3 server.iso
#SECOND: Disabled noveau-driver, otherwise install stops:

sudo apt-get install dkms build-essential linux-headers-generic
sudo bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
$ sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf
sudo update-initramfs -u

# nvidia driver from driver-website is older,
#install driver 470 from nvida.download
# CUDA 11.6 incudes newer driver

#Install CUDA from NVIDIA (with newer driver)
#installs cuda 11.6 with driver 550
# https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=18.04&target_type=runfile_local

#
# rather not ## install with deb(network) (way too much stuff)

# install with runfile...
wget https://developer.download.nvidia.com/compute/cuda/11.6.0/local_installers/cuda_11.6.0_510.39.01_linux.run
sudo sh cuda_11.6.0_510.39.01_linux.run

#check without reboot with nvidia-smi

#Install PiP
curl -sSL https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

# Install torch 1.9 for cuda 11.6
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

# install yolo for training
git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

#TEST torch for CUDA 11.6:
python train.py --img 412 --batch 2 --epochs 2

# -> ERROR nan-nan

pip uninstall torch
pip uninstall torch # run this command twice

# Install torch for Cuda 10.2
pip install torch==1.9.0+cu102 torchvision==0.10.0+cu102 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

#test
python train.py --img 412 --batch 2 --epochs 2

weeix commented 1 year ago

@ozett thank you. I'm using YOLOv8 and had the same problem. Your comments saved me from excessive head scratching.

Environment: GeForce GTX 1650, Windows 11 64-bit, driver 528.02, python 3.9

Version that works for me: torch==1.9.0+cu102

Some other versions that I tried:

torch==1.9.1+cu102 -> dependency conflict
torch==1.10.2+cu102 -> 0% GPU utilization + Could not find module '...\torchvision\image.pyd' (or one of its dependencies)
torch==1.13.1+cu116 -> NaN + [WinError 1455] The paging file is too small for this operation to complete
torch==1.13.1+cu117 -> NaN

Also, because of how Python works in Windows, I had to reduce the number of workers to 1 in order to maximize GPU utilization.

Computer vision is tough.

ultralytics / yolov5