Closed bxhandhxb closed 4 years ago
Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:
Your changes to the default repository. If your issue is not reproducible in a new git clone
version of this repository we can not debug it. Before going further run this code and ensure your issue persists:
sudo rm -rf yolov5 # remove existing
git clone https://github.com/ultralytics/yolov5 && cd yolov5 # clone latest
python detect.py # verify detection
# CODE TO REPRODUCE YOUR ISSUE HERE
Your custom data. If your issue is not reproducible with COCO or COCO128 data we can not debug it. Visit our Custom Training Tutorial for guidelines on training your custom data. Examine train_batch0.jpg
and test_batch0.jpg
for a sanity check of training and testing data.
Your environment. If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, ensure your environment meets all of the requirements.txt dependencies specified below.
If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!
Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6
. To install run:
$ pip install -r requirements.txt
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.
I will try. Thanks for your response.
hi
I find the nan value comes from the computation of ciou loss.
https://github.com/ultralytics/yolov5/blob/master/utils/general.py#L380 v = (4 / math.pi * 2) torch.pow(torch.atan(w2 / h2) - torch.atan(w1 / h1), 2)
During my training, there exists h1 equal to 0. But i'm not sure why this happened.
@bxhandhxb h1 is the target box height (from your labels). zero height labels are filtered out first during label caching and then secondly during training after the images and labels are augmented.
In any case, I just tested a torch.atan() with divide by zero and the output is pi/2, so it is not responsible for your nan:
import torch
torch.atan(torch.tensor(1.) / torch.tensor([0.]))
tensor([1.5708])
sorry for the late reply. I read your codes carefully. I think w1 and h1 are the predicted box width and height. And when h1 and w1 are both zeros, the nan occurs.
@bxhandhxb ah I see. 0 / 0 = nan, 1 / 0 = inf in torch. w1 and h1 come from box1, which the predicted box. So we want to add 1E-16 to all box1 widths and heights then for protected division.
@bxhandhxb I've added an eps term to the IoU function in https://github.com/ultralytics/yolov5/commit/5e0b90de8f7782b3803fa2886bb824c2336358d0, which adds 1e-12 to each box x2, y2. This should ensure that neither box1 nor box2 ever have zero widths or heights.
I believe this should make the function much more robust. Please git pull or clone a new copy and try again.
@glenn-jocher thank you ~
@bxhandhxb you're welcome! Try your same training with a new git clone and see if the error is resolved.
@glenn-jocher unfortunately, something weird occurs.
maybe because of the fp16 training ? ?
I use the following training script
python -m torch.distributed.launch --nproc_per_node 6 train.py --img-size 1920 --batch-size 48 --data ./data/mydata.yaml --cfg ./models/yolov5s.yaml --weights '' --device 1,2,3,5,6,7
maybe I should pull your docker image and retry......
@bxhandhxb is your loss computation done doing fp16 or fp32?
It's possible eps may need to be set larger, or perhaps the eps values should be moved directly into the division denominators so they don't lose precision when added to larger numbers.
It seems like even adding a 0.1 eps value to 1000 will have no effect in fp16.
x = torch.zeros(1) + 1000.
(x.half() + 0.1) - x.half()
Out[29]: tensor([0.], dtype=torch.float16)
I just run your codes and didn't modify anything. I don't know how the torch.cuda.amp works....... btw, my environment is python 3.7.7+torch 1.6.0 because there is no python3.8 docker image in https://hub.docker.com/r/pytorch/pytorch/tags.
oh fp32 can only ensure the 6 decimal digits of precision. and fp16 can only ensure the 3 decimal digits of precision. The above calculation is correct. I set eps to 1e-6 and it works.
@bxhandhxb is your loss computation done doing fp16 or fp32?
It's possible eps may need to be set larger, or perhaps the eps values should be moved directly into the division denominators so they don't lose precision when added to larger numbers.
I got it. 😸 thanks
@bxhandhxb oh, great, it works!
Still, I think I should move the eps into the fraction denominator, because there if the denominator is 0, we don't have to worry about eps losing precision. The way I have it set up now is to add eps to the x2 y2 of each box, but if these values are already very large, say 10 or 100, then eps will 'disappear' having no effect, especially for fp16 ops. Does this make sense?
TODO: Move eps into fraction denominators for IoU calculations.
@bxhandhxb pushed https://github.com/ultralytics/yolov5/commit/5a7d79fbe667c3162d7eacf3f65ab5ff7ef9576f to resolve remaining nan issue on training. Please git pull and try again, and let me know if you see anymore nan's appear in training.
Removing TODO, assuming resolved.
@glenn-jocher hi
I train from scratch for 3 times. Each time I train about 400 iterations and the total loss approximately decreased from 0.18 to 0.145. No nan loss. The following is my training command. I think this problem has been solved.
python -m torch.distributed.launch --nproc_per_node 4 train.py --img-size 1920 --batch-size 32 --data ./data/mydata.yaml --cfg ./models/yolov5s.yaml --weights '' --device 0,4,5,7
But I find the training speed is very slow. I will open a new issue to describe it in detail. 😂 Thanks for your help.
@bxhandhxb oh great, nan's have been successfully banished :)
I'm having nan during training sometime, but if I just run it again with the same parameters, the nan disappears... @glenn-jocher
@Yu-Hang your training shows increasing losses due to instabilities in your training settings. I'll post our general training guidelines below.
👋 Hello! Thanks for asking about improving training results. Most of the time good results can be obtained with no changes to the models or training settings, provided your dataset is sufficiently large and well labelled. If at first you don't get good results, there are steps you might be able to take to improve, but we always recommend users first train with all default settings before considering any changes. This helps establish a performance baseline and spot areas for improvement.
If you have questions about your training results we recommend you provide the maximum amount of information possible if you expect a helpful response, including results plots (train losses, val losses, P, R, mAP), PR curve, confusion matrix, training mosaics, test results and dataset statistics images such as labels.png. All of these are located in your project/name
directory, typically yolov5/runs/train/exp
.
We've put together a full guide for users looking to get the best results on their YOLOv5 trainings below.
Larger models like YOLOv5x and YOLOv5x6 will produce better results in nearly all cases, but have more parameters, require more CUDA memory to train, and are slower to run. For mobile deployments we recommend YOLOv5s/m, for cloud deployments we recommend YOLOv5l/x. See our README table for a full comparison of all models.
--weights
argument. Models download automatically from the latest YOLOv5 release.
python train.py --data custom.yaml --weights yolov5s.pt
yolov5m.pt
yolov5l.pt
yolov5x.pt
--weights ''
argument:
python train.py --data custom.yaml --weights '' --cfg yolov5s.yaml
yolov5m.yaml
yolov5l.yaml
yolov5x.yaml
Before modifying anything, first train with default settings to establish a performance baseline. A full list of train.py settings can be found in the train.py argparser.
--img 640
, though due to the high amount of small objects in the dataset it can benefit from training at higher resolutions such as --img 1280
. If there are many small objects then custom datasets will benefit from training at native or higher resolution. Best inference results are obtained at the same --img
as the training was run at, i.e. if you train at --img 1280
you should also test and detect at --img 1280
.--batch-size
that your hardware allows for. Small batch sizes produce poor batchnorm statistics and should be avoided.hyp['obj']
will help reduce overfitting in those specific loss components. For an automated method of optimizing these hyperparameters, see our Hyperparameter Evolution Tutorial.If you'd like to know more a good place to start is Karpathy's 'Recipe for Training Neural Networks', which has great ideas for training that apply broadly across all ML domains: http://karpathy.github.io/2019/04/25/recipe/
Training setting:
img_size: 1504, batch size:24, epoch: 350, --weights yolov5s.pt --cache --rect. I'm using default hyperparameters.
Do you see anything that can cause this instability? I have read the docs above, but not sure how to tackle this problem. Maybe hyperparameter evolution?
@Yu-Hang labels look fine, but for uniformly sized objects like you're showing you might want to disable autoanchor:
python train.py --noautoanchor
❔Question
hi, I met something weird. when i set distributed dataparallel mode in multiple gpus, the giou loss and obj loss at first decreases and suddenly becomes nan, but when i train the model in single gpu, the loss decreases all the time. The batchsizes are the same. what may be the reason?
Additional context