when using distributed dataparallel mode, the giou loss and obj loss becomes nan

bxhandhxb commented 4 years ago

❔Question

hi, I met something weird. when i set distributed dataparallel mode in multiple gpus, the giou loss and obj loss at first decreases and suddenly becomes nan, but when i train the model in single gpu, the loss decreases all the time. The batchsizes are the same. what may be the reason?

Additional context

glenn-jocher commented 4 years ago

Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:

Your changes to the default repository. If your issue is not reproducible in a new git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:
```
sudo rm -rf yolov5  # remove existing
git clone https://github.com/ultralytics/yolov5 && cd yolov5 # clone latest
python detect.py  # verify detection
# CODE TO REPRODUCE YOUR ISSUE HERE
```
Your custom data. If your issue is not reproducible with COCO or COCO128 data we can not debug it. Visit our Custom Training Tutorial for guidelines on training your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.
Your environment. If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, ensure your environment meets all of the requirements.txt dependencies specified below.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab Notebook with free GPU:
Kaggle Notebook with free GPU: https://www.kaggle.com/models/ultralytics/yolov5
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Docker Image https://hub.docker.com/r/ultralytics/yolov5. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

bxhandhxb commented 4 years ago

I will try. Thanks for your response.

bxhandhxb commented 4 years ago

hi
I find the nan value comes from the computation of ciou loss.

https://github.com/ultralytics/yolov5/blob/master/utils/general.py#L380 v = (4 / math.pi * 2) torch.pow(torch.atan(w2 / h2) - torch.atan(w1 / h1), 2)

During my training, there exists h1 equal to 0. But i'm not sure why this happened.

glenn-jocher commented 4 years ago

@bxhandhxb h1 is the target box height (from your labels). zero height labels are filtered out first during label caching and then secondly during training after the images and labels are augmented.

In any case, I just tested a torch.atan() with divide by zero and the output is pi/2, so it is not responsible for your nan:

import torch
torch.atan(torch.tensor(1.) / torch.tensor([0.]))
tensor([1.5708])

bxhandhxb commented 4 years ago

sorry for the late reply. I read your codes carefully. I think w1 and h1 are the predicted box width and height. And when h1 and w1 are both zeros, the nan occurs.

glenn-jocher commented 4 years ago

@bxhandhxb ah I see. 0 / 0 = nan, 1 / 0 = inf in torch. w1 and h1 come from box1, which the predicted box. So we want to add 1E-16 to all box1 widths and heights then for protected division.

glenn-jocher commented 4 years ago

@bxhandhxb I've added an eps term to the IoU function in https://github.com/ultralytics/yolov5/commit/5e0b90de8f7782b3803fa2886bb824c2336358d0, which adds 1e-12 to each box x2, y2. This should ensure that neither box1 nor box2 ever have zero widths or heights.

I believe this should make the function much more robust. Please git pull or clone a new copy and try again.

bxhandhxb commented 4 years ago

@glenn-jocher thank you ~

glenn-jocher commented 4 years ago

@bxhandhxb you're welcome! Try your same training with a new git clone and see if the error is resolved.

bxhandhxb commented 4 years ago

@glenn-jocher unfortunately, something weird occurs.

maybe because of the fp16 training ? ?

bxhandhxb commented 4 years ago

I use the following training script

python -m torch.distributed.launch --nproc_per_node 6 train.py --img-size 1920 --batch-size 48 --data ./data/mydata.yaml --cfg ./models/yolov5s.yaml --weights '' --device 1,2,3,5,6,7

bxhandhxb commented 4 years ago

maybe I should pull your docker image and retry......

glenn-jocher commented 4 years ago

@bxhandhxb is your loss computation done doing fp16 or fp32?

It's possible eps may need to be set larger, or perhaps the eps values should be moved directly into the division denominators so they don't lose precision when added to larger numbers.

glenn-jocher commented 4 years ago

It seems like even adding a 0.1 eps value to 1000 will have no effect in fp16.

x = torch.zeros(1) + 1000.
(x.half() + 0.1) - x.half()
Out[29]: tensor([0.], dtype=torch.float16)

bxhandhxb commented 4 years ago

I just run your codes and didn't modify anything. I don't know how the torch.cuda.amp works....... btw, my environment is python 3.7.7+torch 1.6.0 because there is no python3.8 docker image in https://hub.docker.com/r/pytorch/pytorch/tags.

bxhandhxb commented 4 years ago

oh fp32 can only ensure the 6 decimal digits of precision. and fp16 can only ensure the 3 decimal digits of precision. The above calculation is correct. I set eps to 1e-6 and it works.

bxhandhxb commented 4 years ago

@bxhandhxb is your loss computation done doing fp16 or fp32?

It's possible eps may need to be set larger, or perhaps the eps values should be moved directly into the division denominators so they don't lose precision when added to larger numbers.

I got it. 😸 thanks

glenn-jocher commented 4 years ago

@bxhandhxb oh, great, it works!

Still, I think I should move the eps into the fraction denominator, because there if the denominator is 0, we don't have to worry about eps losing precision. The way I have it set up now is to add eps to the x2 y2 of each box, but if these values are already very large, say 10 or 100, then eps will 'disappear' having no effect, especially for fp16 ops. Does this make sense?

glenn-jocher commented 4 years ago

TODO: Move eps into fraction denominators for IoU calculations.

glenn-jocher commented 4 years ago

@bxhandhxb pushed https://github.com/ultralytics/yolov5/commit/5a7d79fbe667c3162d7eacf3f65ab5ff7ef9576f to resolve remaining nan issue on training. Please git pull and try again, and let me know if you see anymore nan's appear in training.

Removing TODO, assuming resolved.

bxhandhxb commented 4 years ago

@glenn-jocher hi
I train from scratch for 3 times. Each time I train about 400 iterations and the total loss approximately decreased from 0.18 to 0.145. No nan loss. The following is my training command. I think this problem has been solved.

python -m torch.distributed.launch --nproc_per_node 4 train.py --img-size 1920 --batch-size 32 --data ./data/mydata.yaml --cfg ./models/yolov5s.yaml --weights '' --device 0,4,5,7

But I find the training speed is very slow. I will open a new issue to describe it in detail. 😂 Thanks for your help.

glenn-jocher commented 4 years ago

@bxhandhxb oh great, nan's have been successfully banished :)

Yu-Hang commented 3 years ago

Capture I'm having nan during training sometime, but if I just run it again with the same parameters, the nan disappears... @glenn-jocher

glenn-jocher commented 3 years ago

@Yu-Hang your training shows increasing losses due to instabilities in your training settings. I'll post our general training guidelines below.

👋 Hello! Thanks for asking about improving training results. Most of the time good results can be obtained with no changes to the models or training settings, provided your dataset is sufficiently large and well labelled. If at first you don't get good results, there are steps you might be able to take to improve, but we always recommend users first train with all default settings before considering any changes. This helps establish a performance baseline and spot areas for improvement.

If you have questions about your training results we recommend you provide the maximum amount of information possible if you expect a helpful response, including results plots (train losses, val losses, P, R, mAP), PR curve, confusion matrix, training mosaics, test results and dataset statistics images such as labels.png. All of these are located in your project/name directory, typically yolov5/runs/train/exp.

We've put together a full guide for users looking to get the best results on their YOLOv5 trainings below.

Dataset

Images per class. ≥1.5k images per class
Instances per class. ≥10k instances (labeled objects) per class total
Image variety. Must be representative of deployed environment. For real-world use cases we recommend images from different times of day, different seasons, different weather, different lighting, different angles, different sources (scraped online, collected locally, different cameras) etc.
Label consistency. All instances of all classes in all images must be labelled. Partial labelling will not work.
Label accuracy. Labels must closely enclose each object. No space should exist between an object and it's bounding box. No objects should be missing a label.
Background images. Background images are images with no objects that are added to a dataset to reduce False Positives (FP). We recommend about 0-10% background images to help reduce FPs (COCO has 1000 background images for reference, 1% of the total).

Model Selection

Larger models like YOLOv5x and YOLOv5x6 will produce better results in nearly all cases, but have more parameters, require more CUDA memory to train, and are slower to run. For mobile deployments we recommend YOLOv5s/m, for cloud deployments we recommend YOLOv5l/x. See our README table for a full comparison of all models.

YOLOv5 Models

Start from Pretrained weights. Recommended for small to medium sized datasets (i.e. VOC, VisDrone, GlobalWheat). Pass the name of the model to the --weights argument. Models download automatically from the latest YOLOv5 release.
```
python train.py --data custom.yaml --weights yolov5s.pt
                                         yolov5m.pt
                                         yolov5l.pt
                                         yolov5x.pt
```

Start from Scratch. Recommended for large datasets (i.e. COCO, Objects365, OIv6). Pass the model architecture yaml you are interested in, along with an empty --weights '' argument:

python train.py --data custom.yaml --weights '' --cfg yolov5s.yaml
                                                  yolov5m.yaml
                                                  yolov5l.yaml
                                                  yolov5x.yaml

Training Settings

Before modifying anything, first train with default settings to establish a performance baseline. A full list of train.py settings can be found in the train.py argparser.

Epochs. Start with 300 epochs. If this overfits early then you can reduce epochs. If overfitting does not occur after 300 epochs, train longer, i.e. 600, 1200 etc epochs.
Image size. COCO trains at native resolution of --img 640, though due to the high amount of small objects in the dataset it can benefit from training at higher resolutions such as --img 1280. If there are many small objects then custom datasets will benefit from training at native or higher resolution. Best inference results are obtained at the same --img as the training was run at, i.e. if you train at --img 1280 you should also test and detect at --img 1280.
Batch size. Use the largest --batch-size that your hardware allows for. Small batch sizes produce poor batchnorm statistics and should be avoided.
Hyperparameters. Default hyperparameters are in hyp.scratch.yaml. We recommend you train with default hyperparameters first before thinking of modifying any. In general, increasing augmentation hyperparameters will reduce and delay overfitting, allowing for longer trainings and higher final mAP. Reduction in loss component gain hyperparameters like hyp['obj'] will help reduce overfitting in those specific loss components. For an automated method of optimizing these hyperparameters, see our Hyperparameter Evolution Tutorial.

ultralytics / yolov5