Closed junglezhao closed 4 years ago
Hello @junglezhao, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
use --augment
@junglezhao I would make sure your code is up to date using git pull, and if the issue persists please provide minimum reproducible example code.
@qtw1998 @junglezhao yes an augment boolean can be passed to the model() forward method to conduct augmented inference for higher recall and better mAP, but it is not a required argument, as a default False
value is supplied. Nevertheless you can do augmented inference from the command line with the --augment
argparser argument:
python3 test.py --augment
python3 detect.py --augment
https://github.com/ultralytics/yolov3/blob/4c4f4f4dd465ea11d53306239ff59284420cb207/models.py#L232
use
--augment
ok ,thx. I chose to redownload rep and reset the config to solve this problem.
I am also getting a similar error. I followed the instructions given to train yolov3 on custom dataset. I have prepared my custom dataset according to the required format. When I start training, I get the following error:
Traceback (most recent call last):
File "train.py", line 422, in <module>
train() # train normally
File "train.py", line 317, in train
dataloader=testloader)
File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/test.py", line 94, in test
inf_out, train_out = model(imgs, augment=augment) # inference and training outputs
File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward
outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in replica 3 on device 3.
Original Traceback (most recent call last):
File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/rajat/Desktop/Radspot/Object_detection/yolov3/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'
I think the error occurs during testing, but I have no idea why it is occurring. What can be the reason for this error?
I encountered the same bug when testing (on 8 GPUs), in the last minibatch to be precise. A workaround is to skip the last test iteration: not really the definitive solution, but it works.
@leoll2 @Rajat-Mehta your code may be out of date, I would advise a git pull
or to reclone the current repo.
@glenn-jocher I already tried to pull the latest code. That did not solve my problem.
This error is encountered while training and testing on multiple gpus, I tried to train on single GPU and that resolved my error.
@Rajat-Mehta ok thank you. Are you able to reproduce the error on an open dataset like coco64.data? If so please send us exact code to reproduce and we can get started debugging it.
I updated pytorch from 1.4 to 1.5 and now the training process is not working on multiple GPUs even on coco dataset. But the training works fine when I train using single GPU.
To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:
@glenn-jocher , @junglezhao, @leoll2 I can confirm that this bug still exists. We are using an up to date repo and we get exactly the same error using 4 GPUs at exactly the same point when testing the last minibatch. . The problem does not exist when using single or double GPUs.
Here is the full trace:
Class Images Targets P R mAP@0.5 F1: 100% 1548/1549 [22:49<00:01, 1.01s/it]ATraceback (most recent call last):
File "/root/.trains/venvs-builds/3.6/task_repository/yolov3_training.git/test.py", line 98, in test inf_out, train_out = model(imgs, augment=augment) # inference and training outputs File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs) File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in replica 3 on device 3. Original Traceback (most recent call last): File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) TypeError: forward() missing 1 required positional argument: 'x'
Any other suggestion other than @leoll2's skipping last iteration?
@berkerlogoglu thanks. Can you reproduce this error in a common environment (i.e. the docker image or a gcp vm) on an open dataset like coco64.data?
Without this we can not debug.
Hi @glenn-jocher,
I am working with @berkerlogoglu. I have tried your docker image. Here is my obversations:
The comment written by @berkerlogoglu, https://github.com/ultralytics/yolov3/issues/1074#issuecomment-623629249, was using a custom validation which has approximately 99k images in a different docker. After your suggestion, I have tried it with your docker image and the error occurred again.
After that, I have tried with coco64.data, nothing happened. I thought error occurs in very big datasets and I tried some custom coco validation set with approximately 2k images.
First, I used 16 batch size which makes 125 batches to process, no error occurred. Then, I used 2 batch size which makes 1000 batches to process and the same error occurred.
The error log is:
Traceback (most recent call last): File "train.py", line 475, in train() # train normally File "train.py", line 349, in train dataloader=testloader) File "/root/.trains/venvs-builds/3.6/task_repository/yolov3_training.git/test.py", line 101, in test inf_out, train_out = model(imgs, augment=augment) # inference and training outputs File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 449, in forward outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in replica 2 on device 2. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) TypeError: forward() missing 1 required positional argument: 'x'
I hope we can find a way to solve this problem.. Thanks..
@kaanakan I don't think the dataset size plays a big role. I got the error on a relatively small dataset (~2500 images, 400 validation).
@kaanakan @leoll2 ok thanks. I need to be able to reproduce the issue with a common dataset otherwise we can not debug it.
From what I gather above the error only appears on 4-GPU testing of specific datasets. It is not reproducible on coco64.data. Is it reproducible with coco2017.data or coco2014.data?
@joel5638 as I've mentioned to the others, if you can reproduce this error in a reproducible environment with a reproducible dataset then we can debug. i.e. send us a google colab notebook producing the error on coco if you can.
So the fix is just make a check on the batch size batch_sz = imgs.size()[0] if batch_size == batch_sz:
with torch.no_grad()
etc.........
i tried to push but was not able, for some reason the batch size in the last epoch is not equal to the size of orginal batch_size that's why the error occurs.
@joel5638 can you please make the changes in the code Thanks
@Hidayat722 that's normal for batch sizes to vary, it should not cause a bug. We can not implement your proposed fix, as this will omit mAP computations on the last batch. If you can reproduce this error, please reproduce in a colab notebook on coco so that we may run it ourselves and debug.
Hello, I also encounter a similar problem. This error occurs when using multiple GPUs for training and testing. Is it caused by different kinds of GPUs? This is the details of my device
Using CUDA device0 _CudaDeviceProperties(name='TITAN V', total_memory=12058MB) device1 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11172MB) device2 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11172MB) device3 _CudaDeviceProperties(name='TITAN V', total_memory=12058MB)
@leiyuncong1202 it is not recommended to use different types of gpus togethor. In your case you might want to use --device 0,3 for example or --device 1,2
@leiyuncong1202 it is not recommended to use different types of gpus togethor. In your case you might want to use --device 0,3 for example or --device 1,2
According to your suggestion, my problem has been solved. Thank you~
I also came across this error today when testing using 3 GPUs.
TypeError: forward() missing 1 required positional argument: 'x'
Edit: Want to note that the issue seems to be related to the batch size. A batch size of 18 works but not a batch size of 21. Here is a similar issue found from another repo: https://github.com/Eromera/erfnet_pytorch/issues/2
@linzzzzzz best practices is to use even numbers of GPUs at all times if you use > 1.
@glenn-jocher Thanks for the suggestion :)
This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.
I don't know if this is still relevant. I am currently work on a project and needs to use the archive branch. I ran into the this problem and found a workaround. You can just reconstruct your trian.txt and test.txt file (with different ratio or just randomize them again).
Hypothesis: I haven't done any control experiment yet with coco dataset. But by reading the comment and some experiment I did with my own, it might to do with the number of image on the last batch of the test dataset. This comment might not be relevant since this could be solved in the master branch.
If the hypothesis was true, the workaround could be as simple as delete one or two line of image from the test.txt or train.txt.
Edit: Try some more experiment when encountered, I think the source of the bug is last test batch does not have enough input to fill all the GPU. @glenn-jocher This bug can be reproduced when (number of test sample % batch size < number of GPU) For example, number of test sample = 25, batch size = 24(3x8), number of GPU = 3. Since last batch only has 1 image, the forward will have missing parameter in other two GPU. Fix: if GPU count is low, simply add few samples to fit GPU count or delete few samples. If GPU count is high, well...
@tommyma0402 thanks for sharing your findings! Your investigation and insights are valuable for the community. This indeed seems like a valid hypothesis and a practical workaround for this issue. Your thorough experiment and proposed fix can help others who encounter the same problem. Keep up the great work!
🐛 Bug
A clear and concise description of what the bug is. Hi guys, when test.py runs at 99%, it occurs to a error like the following :
(I don't change the file...)