ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.47k stars 16.28k forks source link

Possible Bug Training on Empty Batch? #609

Closed ZeKunZhang1998 closed 4 years ago

ZeKunZhang1998 commented 4 years ago

❔Question

Traceback (most recent call last): File "train.py", line 463, in train(hyp, tb_writer, opt, device) File "train.py", line 286, in train loss, loss_items = compute_loss(pred, targets.to(device), model) # scaled by batch_size File "/content/drive/My Drive/yolov5/utils/utils.py", line 443, in compute_loss tcls, tbox, indices, anchors = build_targets(p, targets, model) # targets File "/content/drive/My Drive/yolov5/utils/utils.py", line 542, in build_targets b, c = t[:, :2].long().T # image, class ValueError: too many values to unpack (expected 2)

Additional context

glenn-jocher commented 4 years ago

Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -U -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Current Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) test are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

acai66 commented 4 years ago

Changing batchsize will solve this issue, it will occur when a batch of images contain no object.

glenn-jocher commented 4 years ago

@acai66 ah I see. I remember a similar issue about testing with no targets, but I believed this was resolved. Does this occur when training or testing? Can you supply code to reproduce?

MiiaBestLamia commented 4 years ago

While changing the batchsize helped prolong the learning process, this issue still occurs for me. By printing the paths of the images in each batch i can check to see if they have an object, and there's definitely at least one object in each occasion (my dataset doesn't have images with empty label files) of the crash.

glenn-jocher commented 4 years ago

@MiiaBestLamia can you supply exact steps and code to reproduce this issue following the steps outlined before (current repo, valid environment, common dataset?)

MiiaBestLamia commented 4 years ago

@glenn-jocher i'm using the most recent repository, all of the requirements are satisfied except the Python version, which might be the reason for the issue, though it seems strange that the learning process works for a bit, then crashes (I'm using 3.6.9). I have altered only one line of code in the repository, the one that prints paths of the files used in the batch. Using a dataset that I may not be allowed to share, so I cannot provide you with the files. I'm using the .txt option of providing the train and val sets, i have only one class, using image size 800 with batch size 4, providing yolov5l.pt weights (that I downloaded from the google drive), would be nice to see what @ZeKunZhang1998 is working with, maybe that can narrow down the problem.

glenn-jocher commented 4 years ago

@MiiaBestLamia I would verify your issue is reproducible in one of the environments above. That's what they're there for.

ZeKunZhang1998 commented 4 years ago

Changing batchsize will solve this issue, it will occur when a batch of images contain no object.

when I change to batch size = 1,another picture is something wrong.

ZeKunZhang1998 commented 4 years ago

@glenn-jocher i'm using the most recent repository, all of the requirements are satisfied except the Python version, which might be the reason for the issue, though it seems strange that the learning process works for a bit, then crashes (I'm using 3.6.9). I have altered only one line of code in the repository, the one that prints paths of the files used in the batch. Using a dataset that I may not be allowed to share, so I cannot provide you with the files. I'm using the .txt option of providing the train and val sets, i have only one class, using image size 800 with batch size 4, providing yolov5l.pt weights (that I downloaded from the google drive), would be nice to see what @ZeKunZhang1998 is working with, maybe that can narrow down the problem.

I use a private dataset.When I jump the wrong batch , the problem will be solved.The wrong batch has object, it is not an empty picture.

acai66 commented 4 years ago

Maybe i am wrong about this issue, but this issue disappeared when i changed to another batchsize. I've added print(targets.shape) in build_targets, i got this Tensor.Size([0, 6]) when ValueError: too many values to unpack (expected 2)

glenn-jocher commented 4 years ago

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

acai66 commented 4 years ago

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

I will reinstall pytorch from source code and fetch latest yolov5, i will upload my datasets if this issue occurs again.

ZJU-lishuang commented 4 years ago

I meet the same problem too. I think the reason is that after box_candidates function there are no targets.

MiiaBestLamia commented 4 years ago

After changing some of the hyperparameters in train.py (lr0:0.001, scale:0.2), moving the project to a computer with a beefier GPU (went from 2060S to 2080Ti, using Python 3.6.9) and increasing the batch size to 8, training is functioning properly for 4 epochs now. I can still reproduce the issue with my data on the 2080Ti by launching train.py with a batch size of 4, so I suppose this issue is caused by peculiarities in data, not problems with the network/code.

acai66 commented 4 years ago

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

here is my datasets, https://drive.google.com/file/d/12epqSyYELm7c4mXXIIJos33KDcj1l67f/view?usp=sharing

data yaml: 2020.yaml.txt

models yaml: yolov5x_2020.yaml.txt

train command: python train.py --cfg models/yolov5x_2020.yaml --data data/2020.yaml --epochs 300 --batch-size 8 --img-size 512 512 --cache-images --weights '' --name "yolov5x_2020_default" --single-cls

This issue disappeared when changed batchsize to 12.

ZeKunZhang1998 commented 4 years ago

When I use batch size = 4, it works.

ZeKunZhang1998 commented 4 years ago

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

Hi,I think some data augmentation make the boxes disappear, right? If you use bigger batch size , this issue will disappear.

mrk230 commented 4 years ago

On a private dataset I have also had this issue with batch size = 16. Haven't fully tested further, but the dataset does include a fair amount of images without objects. For what that's worth.

buimanhlinh96 commented 4 years ago

It means that whole dataset must have object, if an image isnt labelled then wrong, isnt it?

glenn-jocher commented 4 years ago

@buimanhlinh96 that is not correct. COCO has over a thousand images without labels.

buimanhlinh96 commented 4 years ago

@glenn-jocher so what happended with this issue? It might in a batch, we must have at least one image is labelled?

glenn-jocher commented 4 years ago

@acai66 the only thing we can act on is a reproducible example in one of the verified environments.

here is my datasets, https://drive.google.com/file/d/12epqSyYELm7c4mXXIIJos33KDcj1l67f/view?usp=sharing

data yaml: 2020.yaml.txt

models yaml: yolov5x_2020.yaml.txt

train command: python train.py --cfg models/yolov5x_2020.yaml --data data/2020.yaml --epochs 300 --batch-size 8 --img-size 512 512 --cache-images --weights '' --name "yolov5x_2020_default" --single-cls

This issue disappeared when changed batchsize to 12.

@acai66 thank you! I think I can work with this. I can only debug official models though, so I will use yolov5x.yaml in place of yours. Do you yourself see the error when running on the default models?

glenn-jocher commented 4 years ago

@buimanhlinh96 I don't know, I have not tried to reproduce yet. I know test.py operates correctly on datasets without labels, I don't know about train.py. Can you provide minimum viable code to reproduce your specific issue?

glenn-jocher commented 4 years ago

@acai66 also are you able to reproduce in one of the verified environments?

buimanhlinh96 commented 4 years ago

@glenn-jocher I try some experiments and come up with the batch-size should be more than or equal 8

glenn-jocher commented 4 years ago

@buimanhlinh96 there is no constraint on batch size, so you should be able to use batch size 1 to batch size x, whatever your hardware can handle. If this is not the case then there is a bug.

acai66 commented 4 years ago

@acai66 also are you able to reproduce in one of the verified environments?

you can try default yolov5x.yaml. I just change nc to 1 in yolov5x_2020_default.yaml actually.

buimanhlinh96 commented 4 years ago

@glenn-jocher Yes. Hopefully we can fix it ASAP. Love yolov5

glenn-jocher commented 4 years ago

@acai66 ah I see, of course. We actually updated train.py a few weeks back to inherit nc from the data.yaml in case of a mismatch with the model yaml nc, so you should be able to use your command with the default 80 class yolov5x.yaml as well, and it will still operate correctly.

Ok, I will try to reproduce this in a colab notebook today if I have time.

Jacobsolawetz commented 4 years ago

@glenn-jocher I'm in the same boat.

For me the bug hits right after the first epoch (which successfully completes), when moving to the second epoch.

It seems fixed by moving the batch size from 4 to 12 as suggested above (Colab runs out of memory on this dataset at 16).

glenn-jocher commented 4 years ago

@Jacobsolawetz hmm ok. Do you have a pretty sparse dataset, do you think it's possible a whole batch of 4 images might have no labels? Does the bug happen during training or testing?

glenn-jocher commented 4 years ago

@acai66 I'm able to reproduce this in a colab notebook: https://colab.research.google.com/drive/1bCFd_1fyFG8pkXkQ8MubvRSgFsb9ZPhu#scrollTo=-AVqcyhjO89V

I see this midway through the first epoch:

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     0/299     4.78G   0.07214   0.01437         0   0.08651         3       512:  70% 105/150 [00:58<00:22,  2.02it/s]Traceback (most recent call last):
  File "train.py", line 477, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 300, in train
    loss, loss_items = compute_loss(pred, targets.to(device), model)  # scaled by batch_size
  File "/content/yolov5/utils/general.py", line 446, in compute_loss
    tcls, tbox, indices, anchors = build_targets(p, targets, model)  # targets
  File "/content/yolov5/utils/general.py", line 545, in build_targets
    b, c = t[:, :2].long().T  # image, class
ValueError: too many values to unpack (expected 2)
     0/299     4.78G   0.07214   0.01437         0   0.08651         3       512:  70% 105/150 [00:58<00:25,  1.79it/s]
glenn-jocher commented 4 years ago

@ZeKunZhang1998 @mrk230 @Jacobsolawetz @acai66 @buimanhlinh96 this issue should be resolved now in https://github.com/ultralytics/yolov5/commit/7eaf225d558c6495190e0c79a56553633a065c49. Please git pull to receive the latest updates and try again.

Let us know if you run into anymore problems, and good luck!

glenn-jocher commented 4 years ago

@acai66 for your dataset I would recommend several changes:

  1. You have very small objects, you need to train at the highest viable resolution, even if it means using a smaller model.
  2. Start from pretrained weights for best results, but also try training from scratch to compare.
  3. Use the largest batch size that will fit into RAM.
  4. Your dataset is different enough from COCO that it may benefit from substantially different hyperparameters. See hyperparameter evolution tutorial: https://docs.ultralytics.com/yolov5
acai66 commented 4 years ago

@acai66 for your dataset I would recommend several changes:

  1. You have very small objects, you need to train at the highest viable resolution, even if it means using a smaller model.
  2. Start from pretrained weights for best results, but also try training from scratch to compare.
  3. Use the largest batch size that will fit into RAM.
  4. Your dataset is different enough from COCO that it may benefit from substantially different hyperparameters. See hyperparameter evolution tutorial: https://docs.ultralytics.com/yolov5

Thank you very much for your recommendation, and i will try to do that,. This issue was solved after git pull latest commits.

Jacobsolawetz commented 4 years ago

@glenn-jocher yes... after introspection, there are maybe 6 or so images in the dataset of 500 that do not have annotations. A random grouping of those may have caused the cough.

Thanks for fixing this bug so quickly!

buimanhlinh96 commented 4 years ago

@glenn-jocher Thank you very much!!!!!!!

anhnktp commented 4 years ago

[@glenn-jocher] I am also facing the same issue (cloned latest code). Maybe the bug still remain, it quite strange because it can train to final epoch before the error happens. I trained with yolov5-s.yaml, batch-size=100 (maybe it is too large ?) on 2 GPU RTX 2080Ti. Every image contain at least one object

Screen Shot 2020-08-06 at 09 37 37
glenn-jocher commented 4 years ago

@anhnktp no, you are incorrect, you are not using the latest code. L545 no longer contains the same code, so the error message you see is not possible to produce in origin/master.

anhnktp commented 4 years ago

@glenn-jocher oh, I see. It is yolov5 version 2 days ago. You added some code. I'll recheck again. Thank you

Kachasukintim commented 3 years ago

Hello developer yolov5 I would like you to update the same with yolov4 pytorch in google colab, I tried it and yolov4 had the same problem. please help me. thank you in advance for your help

glenn-jocher commented 3 years ago

@Kachasukintim πŸ‘‹ hi, thanks for letting us know about this problem with YOLOv5 πŸš€. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the πŸ› Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! πŸ˜ƒ