ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.33k stars 16.24k forks source link

when using distributed dataparallel mode, the giou loss and obj loss becomes nan #736

Closed bxhandhxb closed 4 years ago

bxhandhxb commented 4 years ago

❔Question

hi, I met something weird. when i set distributed dataparallel mode in multiple gpus, the giou loss and obj loss at first decreases and suddenly becomes nan, but when i train the model in single gpu, the loss decreases all the time. The batchsizes are the same. what may be the reason?

Additional context

glenn-jocher commented 4 years ago

Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

bxhandhxb commented 4 years ago

I will try. Thanks for your response.

bxhandhxb commented 4 years ago

hi
I find the nan value comes from the computation of ciou loss.

https://github.com/ultralytics/yolov5/blob/master/utils/general.py#L380 v = (4 / math.pi * 2) torch.pow(torch.atan(w2 / h2) - torch.atan(w1 / h1), 2)

During my training, there exists h1 equal to 0. But i'm not sure why this happened.

glenn-jocher commented 4 years ago

@bxhandhxb h1 is the target box height (from your labels). zero height labels are filtered out first during label caching and then secondly during training after the images and labels are augmented.

In any case, I just tested a torch.atan() with divide by zero and the output is pi/2, so it is not responsible for your nan:

import torch
torch.atan(torch.tensor(1.) / torch.tensor([0.]))
tensor([1.5708])
bxhandhxb commented 4 years ago

sorry for the late reply. I read your codes carefully. I think w1 and h1 are the predicted box width and height. And when h1 and w1 are both zeros, the nan occurs.

截屏2020-08-20 下午4 22 09
glenn-jocher commented 4 years ago

@bxhandhxb ah I see. 0 / 0 = nan, 1 / 0 = inf in torch. w1 and h1 come from box1, which the predicted box. So we want to add 1E-16 to all box1 widths and heights then for protected division.

glenn-jocher commented 4 years ago

@bxhandhxb I've added an eps term to the IoU function in https://github.com/ultralytics/yolov5/commit/5e0b90de8f7782b3803fa2886bb824c2336358d0, which adds 1e-12 to each box x2, y2. This should ensure that neither box1 nor box2 ever have zero widths or heights.

I believe this should make the function much more robust. Please git pull or clone a new copy and try again.

bxhandhxb commented 4 years ago

@glenn-jocher thank you ~

glenn-jocher commented 4 years ago

@bxhandhxb you're welcome! Try your same training with a new git clone and see if the error is resolved.

bxhandhxb commented 4 years ago

@glenn-jocher unfortunately, something weird occurs.

截屏2020-08-21 下午3 42 46

maybe because of the fp16 training ? ?

bxhandhxb commented 4 years ago

I use the following training script

python -m torch.distributed.launch --nproc_per_node 6 train.py --img-size 1920 --batch-size 48 --data ./data/mydata.yaml --cfg ./models/yolov5s.yaml --weights '' --device 1,2,3,5,6,7

bxhandhxb commented 4 years ago

maybe I should pull your docker image and retry......

glenn-jocher commented 4 years ago

@bxhandhxb is your loss computation done doing fp16 or fp32?

It's possible eps may need to be set larger, or perhaps the eps values should be moved directly into the division denominators so they don't lose precision when added to larger numbers.

glenn-jocher commented 4 years ago

It seems like even adding a 0.1 eps value to 1000 will have no effect in fp16.

x = torch.zeros(1) + 1000.
(x.half() + 0.1) - x.half()
Out[29]: tensor([0.], dtype=torch.float16)
bxhandhxb commented 4 years ago

I just run your codes and didn't modify anything. I don't know how the torch.cuda.amp works....... btw, my environment is python 3.7.7+torch 1.6.0 because there is no python3.8 docker image in https://hub.docker.com/r/pytorch/pytorch/tags.

bxhandhxb commented 4 years ago

oh fp32 can only ensure the 6 decimal digits of precision. and fp16 can only ensure the 3 decimal digits of precision. The above calculation is correct. I set eps to 1e-6 and it works.

bxhandhxb commented 4 years ago

@bxhandhxb is your loss computation done doing fp16 or fp32?

It's possible eps may need to be set larger, or perhaps the eps values should be moved directly into the division denominators so they don't lose precision when added to larger numbers.

I got it. 😸 thanks

glenn-jocher commented 4 years ago

@bxhandhxb oh, great, it works!

Still, I think I should move the eps into the fraction denominator, because there if the denominator is 0, we don't have to worry about eps losing precision. The way I have it set up now is to add eps to the x2 y2 of each box, but if these values are already very large, say 10 or 100, then eps will 'disappear' having no effect, especially for fp16 ops. Does this make sense?

glenn-jocher commented 4 years ago

TODO: Move eps into fraction denominators for IoU calculations.

glenn-jocher commented 4 years ago

@bxhandhxb pushed https://github.com/ultralytics/yolov5/commit/5a7d79fbe667c3162d7eacf3f65ab5ff7ef9576f to resolve remaining nan issue on training. Please git pull and try again, and let me know if you see anymore nan's appear in training.

Removing TODO, assuming resolved.

bxhandhxb commented 4 years ago

@glenn-jocher hi
I train from scratch for 3 times. Each time I train about 400 iterations and the total loss approximately decreased from 0.18 to 0.145. No nan loss. The following is my training command. I think this problem has been solved.

python -m torch.distributed.launch --nproc_per_node 4 train.py --img-size 1920 --batch-size 32 --data ./data/mydata.yaml --cfg ./models/yolov5s.yaml --weights '' --device 0,4,5,7

But I find the training speed is very slow. I will open a new issue to describe it in detail. 😂 Thanks for your help.

glenn-jocher commented 4 years ago

@bxhandhxb oh great, nan's have been successfully banished :)

Yu-Hang commented 3 years ago

Capture I'm having nan during training sometime, but if I just run it again with the same parameters, the nan disappears... @glenn-jocher

glenn-jocher commented 3 years ago

@Yu-Hang your training shows increasing losses due to instabilities in your training settings. I'll post our general training guidelines below.

👋 Hello! Thanks for asking about improving training results. Most of the time good results can be obtained with no changes to the models or training settings, provided your dataset is sufficiently large and well labelled. If at first you don't get good results, there are steps you might be able to take to improve, but we always recommend users first train with all default settings before considering any changes. This helps establish a performance baseline and spot areas for improvement.

If you have questions about your training results we recommend you provide the maximum amount of information possible if you expect a helpful response, including results plots (train losses, val losses, P, R, mAP), PR curve, confusion matrix, training mosaics, test results and dataset statistics images such as labels.png. All of these are located in your project/name directory, typically yolov5/runs/train/exp.

We've put together a full guide for users looking to get the best results on their YOLOv5 trainings below.

Dataset

COCO Analysis

Model Selection

Larger models like YOLOv5x and YOLOv5x6 will produce better results in nearly all cases, but have more parameters, require more CUDA memory to train, and are slower to run. For mobile deployments we recommend YOLOv5s/m, for cloud deployments we recommend YOLOv5l/x. See our README table for a full comparison of all models.

YOLOv5 Models

Training Settings

Before modifying anything, first train with default settings to establish a performance baseline. A full list of train.py settings can be found in the train.py argparser.

Further Reading

If you'd like to know more a good place to start is Karpathy's 'Recipe for Training Neural Networks', which has great ideas for training that apply broadly across all ML domains: http://karpathy.github.io/2019/04/25/recipe/

Yu-Hang commented 3 years ago

Training setting:

img_size: 1504, batch size:24, epoch: 350, --weights yolov5s.pt --cache --rect. I'm using default hyperparameters. labels

Do you see anything that can cause this instability? I have read the docs above, but not sure how to tackle this problem. Maybe hyperparameter evolution?

glenn-jocher commented 3 years ago

@Yu-Hang labels look fine, but for uniformly sized objects like you're showing you might want to disable autoanchor:

python train.py --noautoanchor