junglezhao commented 4 years ago

🐛 Bug

A clear and concise description of what the bug is. Hi guys, when test.py runs at 99%, it occurs to a error like the following :
(I don't change the file...)

Traceback (most recent call last):
  File "test.py", line 255, in <module>
    opt.augment
  File "test.py", line 94, in test
    inf_out, train_out = model(imgs, augment=augment)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/root/anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

github-actions[bot] commented 4 years ago

Hello @junglezhao, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

qtw1998 commented 4 years ago

use --augment

glenn-jocher commented 4 years ago

@junglezhao I would make sure your code is up to date using git pull, and if the issue persists please provide minimum reproducible example code.

glenn-jocher commented 4 years ago

@qtw1998 @junglezhao yes an augment boolean can be passed to the model() forward method to conduct augmented inference for higher recall and better mAP, but it is not a required argument, as a default False value is supplied. Nevertheless you can do augmented inference from the command line with the --augment argparser argument:

python3 test.py --augment
python3 detect.py --augment

https://github.com/ultralytics/yolov3/blob/4c4f4f4dd465ea11d53306239ff59284420cb207/models.py#L232

junglezhao commented 4 years ago

use --augment

ok ,thx. I chose to redownload rep and reset the config to solve this problem.

Rajat-Mehta commented 4 years ago

I am also getting a similar error. I followed the instructions given to train yolov3 on custom dataset. I have prepared my custom dataset according to the required format. When I start training, I get the following error:

Traceback (most recent call last):
  File "train.py", line 422, in <module>
    train()  # train normally
  File "train.py", line 317, in train
    dataloader=testloader)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/test.py", line 94, in test
    inf_out, train_out = model(imgs, augment=augment)  # inference and training outputs
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward
    outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 3 on device 3.
Original Traceback (most recent call last):
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

I think the error occurs during testing, but I have no idea why it is occurring. What can be the reason for this error?

leoll2 commented 4 years ago

I encountered the same bug when testing (on 8 GPUs), in the last minibatch to be precise. A workaround is to skip the last test iteration: not really the definitive solution, but it works.

glenn-jocher commented 4 years ago

@leoll2 @Rajat-Mehta your code may be out of date, I would advise a git pull or to reclone the current repo.

Rajat-Mehta commented 4 years ago

@glenn-jocher I already tried to pull the latest code. That did not solve my problem.

This error is encountered while training and testing on multiple gpus, I tried to train on single GPU and that resolved my error.

glenn-jocher commented 4 years ago

@Rajat-Mehta ok thank you. Are you able to reproduce the error on an open dataset like coco64.data? If so please send us exact code to reproduce and we can get started debugging it.

Rajat-Mehta commented 4 years ago

I updated pytorch from 1.4 to 1.5 and now the training process is not working on multiple GPUs even on coco dataset. But the training works fine when I train using single GPU.

glenn-jocher commented 4 years ago

Reproduce Our Environment

To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:

GCP Deep Learning VM with $300 free credit offer: GCP Quickstart Guide
Google Colab Notebook with 12 hours of free GPU time: Google Colab Notebook
Docker Image from https://hub.docker.com/r/ultralytics/yolov3. See Docker Quickstart Guide

berkerlogoglu commented 4 years ago

@glenn-jocher , @junglezhao, @leoll2 I can confirm that this bug still exists. We are using an up to date repo and we get exactly the same error using 4 GPUs at exactly the same point when testing the last minibatch. . The problem does not exist when using single or double GPUs.

Here is the full trace:

Class Images Targets P R mAP@0.5 F1: 100% 1548/1549 [22:49<00:01, 1.01s/it]ATraceback (most recent call last):

File "/root/.trains/venvs-builds/3.6/task_repository/yolov3_training.git/test.py", line 98, in test inf_out, train_out = model(imgs, augment=augment) # inference and training outputs File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs) File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in replica 3 on device 3. Original Traceback (most recent call last): File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) TypeError: forward() missing 1 required positional argument: 'x'

Any other suggestion other than @leoll2's skipping last iteration?

glenn-jocher commented 4 years ago

@berkerlogoglu thanks. Can you reproduce this error in a common environment (i.e. the docker image or a gcp vm) on an open dataset like coco64.data?

Without this we can not debug.

kaanakan commented 4 years ago

Hi @glenn-jocher,

I am working with @berkerlogoglu. I have tried your docker image. Here is my obversations:

The comment written by @berkerlogoglu, https://github.com/ultralytics/yolov3/issues/1074#issuecomment-623629249, was using a custom validation which has approximately 99k images in a different docker. After your suggestion, I have tried it with your docker image and the error occurred again.

After that, I have tried with coco64.data, nothing happened. I thought error occurs in very big datasets and I tried some custom coco validation set with approximately 2k images.

First, I used 16 batch size which makes 125 batches to process, no error occurred. Then, I used 2 batch size which makes 1000 batches to process and the same error occurred.

The error log is:

Traceback (most recent call last): File "train.py", line 475, in train() # train normally File "train.py", line 349, in train dataloader=testloader) File "/root/.trains/venvs-builds/3.6/task_repository/yolov3_training.git/test.py", line 101, in test inf_out, train_out = model(imgs, augment=augment) # inference and training outputs File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in replica 2 on device 2. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) TypeError: forward() missing 1 required positional argument: 'x'

I hope we can find a way to solve this problem.. Thanks..

leoll2 commented 4 years ago

@kaanakan I don't think the dataset size plays a big role. I got the error on a relatively small dataset (~2500 images, 400 validation).

glenn-jocher commented 4 years ago

@kaanakan @leoll2 ok thanks. I need to be able to reproduce the issue with a common dataset otherwise we can not debug it.

From what I gather above the error only appears on 4-GPU testing of specific datasets. It is not reproducible on coco64.data. Is it reproducible with coco2017.data or coco2014.data?

glenn-jocher commented 4 years ago

@joel5638 as I've mentioned to the others, if you can reproduce this error in a reproducible environment with a reproducible dataset then we can debug. i.e. send us a google colab notebook producing the error on coco if you can.

Hidayat722 commented 4 years ago

So the fix is just make a check on the batch size batch_sz = imgs.size()[0] if batch_size == batch_sz:

Disable gradients

        with torch.no_grad()

etc.........

i tried to push but was not able, for some reason the batch size in the last epoch is not equal to the size of orginal batch_size that's why the error occurs.

Hidayat722 commented 4 years ago

@joel5638 can you please make the changes in the code Thanks

glenn-jocher commented 4 years ago

@Hidayat722 that's normal for batch sizes to vary, it should not cause a bug. We can not implement your proposed fix, as this will omit mAP computations on the last batch. If you can reproduce this error, please reproduce in a colab notebook on coco so that we may run it ourselves and debug.

leiyuncong1202 commented 4 years ago

Hello, I also encounter a similar problem. This error occurs when using multiple GPUs for training and testing. Is it caused by different kinds of GPUs? This is the details of my device

Using CUDA device0 _CudaDeviceProperties(name='TITAN V', total_memory=12058MB) device1 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11172MB) device2 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11172MB) device3 _CudaDeviceProperties(name='TITAN V', total_memory=12058MB)

glenn-jocher commented 4 years ago

@leiyuncong1202 it is not recommended to use different types of gpus togethor. In your case you might want to use --device 0,3 for example or --device 1,2

leiyuncong1202 commented 4 years ago

@leiyuncong1202 it is not recommended to use different types of gpus togethor. In your case you might want to use --device 0,3 for example or --device 1,2

According to your suggestion, my problem has been solved. Thank you~

linzzzzzz commented 4 years ago

I also came across this error today when testing using 3 GPUs. TypeError: forward() missing 1 required positional argument: 'x'

Edit: Want to note that the issue seems to be related to the batch size. A batch size of 18 works but not a batch size of 21. Here is a similar issue found from another repo: https://github.com/Eromera/erfnet_pytorch/issues/2

glenn-jocher commented 4 years ago

@linzzzzzz best practices is to use even numbers of GPUs at all times if you use > 1.

linzzzzzz commented 4 years ago

@glenn-jocher Thanks for the suggestion :)

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

tommyma0402 commented 3 years ago

I don't know if this is still relevant. I am currently work on a project and needs to use the archive branch. I ran into the this problem and found a workaround. You can just reconstruct your trian.txt and test.txt file (with different ratio or just randomize them again).

Hypothesis: I haven't done any control experiment yet with coco dataset. But by reading the comment and some experiment I did with my own, it might to do with the number of image on the last batch of the test dataset. This comment might not be relevant since this could be solved in the master branch.

If the hypothesis was true, the workaround could be as simple as delete one or two line of image from the test.txt or train.txt.

Edit: Try some more experiment when encountered, I think the source of the bug is last test batch does not have enough input to fill all the GPU. @glenn-jocher This bug can be reproduced when (number of test sample % batch size < number of GPU) For example, number of test sample = 25, batch size = 24(3x8), number of GPU = 3. Since last batch only has 1 image, the forward will have missing parameter in other two GPU. Fix: if GPU count is low, simply add few samples to fit GPU count or delete few samples. If GPU count is high, well...

glenn-jocher commented 1 year ago

@tommyma0402 thanks for sharing your findings! Your investigation and insights are valuable for the community. This indeed seems like a valid hypothesis and a practical workaround for this issue. Your thorough experiment and proposed fix can help others who encounter the same problem. Keep up the great work!

ultralytics / yolov3

TypeError: forward() missing 1 required positional argument: 'x' #1074

🐛 Bug

Reproduce Our Environment

Disable gradients