ouening commented 4 years ago

Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:

Current repo: run git fetch && git status -uno to check and git pull to update repo
Common dataset: coco.yaml or coco128.yaml
Common environment: Colab, Google Cloud, or Docker image. See https://github.com/ultralytics/yolov5#reproduce-our-environment

If this is a custom dataset/training question you must include your train*.jpg, test*.jpg and results.png figures, or we can not help you. You can generate these with utils.plot_results().

🐛 Bug

Hi, I'm training yolov5 using custom dataset with 8 classes. The dataset is in other directory (does it necessary to be next to the /yolov5 directory? ). Structure of dataset is:

 |----CustomImages
    ├── classes.txt
    ├── /images # store *.png
    ├── /labels  # store *.txt
    ├── test.txt
    ├── train.txt
    ├── trainval.txt
    ├── val.txt

Content of yml file is:

# train and val datasets (image directory or *.txt file with image paths)
train: /path/to/trainval.txt
val: /path/to/test.txt

# number of classes
nc: 8

# class names
names: ['screw_drive',
'blade',
'knife',
'scissors',
'board_marker',
'mobile_phone',
'wireless_mouse',
'water_bottle']

To Reproduce (REQUIRED)

Input:

python3 train.py --img 640 --batch 8 --epochs 5 --data ./data/terahertz.yml --cfg ./models/yolov5s.yaml --weights ''

Output:

Starting training for 5 epochs...

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       0/4    0.394G    0.1007   0.03779   0.06582    0.2043         3       640: 100%|████████████████████████████████████████████████████████████████| 105/105 [00:21<00:00,  4.80it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95:  96%|████████████████████████████████████████████████████  | 26/27 [00:05<00:00,  5.09it/s]
Traceback (most recent call last):
  File "train.py", line 403, in <module>
    train(hyp)
  File "train.py", line 308, in train
    fast=epoch < epochs / 2)
  File "/media/gaoya/disk/Documents/zhouwenqing/毕业设计/3_models/yolov5/yolov5/test.py", line 103, in test
    inf_out, train_out = model(img, augment=augment)  # inference and training outputs
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 449, in forward
    outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 474, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

Expected behavior

It should train normally, but it failed after the first epoch test, I don't know why? Is it beacuse the dataset is not next to /yolov5?

Environment

If applicable, add screenshots to help explain your problem.

OS: Ubuntu18.04
GPU: RTX 2080 (8G)

Additional context

Add any other context about the problem here.

github-actions[bot] commented 4 years ago

Hello @ouening, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

Cloud-based AI systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

glenn-jocher commented 4 years ago

@ouening if this occurs on your data and not coco128.yaml, then the cause will be your data. This guide might help: https://blog.roboflow.ai/how-to-train-yolov5-on-a-custom-dataset/

ouening commented 4 years ago

Thank you! The same dataset can be trained using yolov3 successfully. I'll try later

JustinMBrown commented 4 years ago

I get this error when using multi GPUs. The temporary fix is to use only a single GPU with --device 0

glenn-jocher commented 4 years ago

@JustinMBrown thanks for the feedback. Does this error also appear in the docker container? If so please give us instructions to reproduce this error, ideally on a dataset we can access like coco128.yaml. Thanks!

NanoCode012 commented 4 years ago

I also experienced a similar issue as Justin when training on my custom dataset. However, I got quite peculiar results.

When training on 4 GPU, I got the forward error from the first post. This happened multiple times, so I tried to reduce the GPU count.

When training on 2 GPU, I got significantly worse mAP. 80-90(single)>47(2 GPU). I will try to reproduce this error. As I've chosen to use single GPU after experiencing this.

Code to run for single: CUDA_VISIBLE_DEVICES=0 python train.py --img 640 --batch 16 --epochs 1000 --data ./data/obj.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt --nosave --cache

I changed the cfg to follow my class count. For multiple, I just changed the DEVICES variable .

This code was cloned on the 24th. I am now testing the newest pull on single GPU first, then multiple later.

glenn-jocher commented 4 years ago

@NanoCode012 thanks. The reduced multi-gpu performance may be due to batch norm layers not synchronizing across gpus. I think there's a syncbatchnorm layer that might be used to fix this. We do all of our training on single GPUs, so it's not something we've explored in depth.

To use a specific cuda device: python train.py --device 2

To use specific multiple devices: python train.py --device 0,1,2,3

NanoCode012 commented 4 years ago

@glenn-jocher .

I just tried to run with 4 GPUs(and 2 GPUs) via : CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --img 640 --batch 16 --epochs 1000 --data ./data/obj.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt --nosave --cache --device 0,1,2,3

python train.py --img 640 --batch 16 --epochs 1000 --data ./data/obj.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt --nosave --cache --device 0,1,2,3

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --img 640 --batch 16 --epochs 1000 --data ./data/obj.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt --nosave --cache

A mix of with and without the devices argument.

They still gave the same error forward() missing 1 required.. when calculating mAP of first epoch.

I also noted that

UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. NB: There is a known issue in nn.parallel.replicate that prevents a single DDP instance to operate on multiple model replicas.

occurs no matter if I specify CUDA_VISIBLE_DEVICES or just --device

As of now, I can't run it with more than 1 GPU on the latest code base.

xevolesi commented 4 years ago

Hi, @glenn-jocher!

Also got TypeError: forward() missing 1 required postional argument: 'x' Custom single class dataset, 4x Tesla V100, batch 2, 4, 8, etc, input size 1024. Used 2, 3 and 4 GPUs. Now trying with single GPU.

glenn-jocher commented 4 years ago

@NanoCode012 @xevolesi @JustinMBrown thanks everyone. To be honest, the people that are best positioned to debug these multi-gpu bugs are you guys. We do most of our training on single-GPU. I've got 8 VMs running single-GPU experiments at the moment, as this is the most efficient cost structure in terms of FLOPS/$. This means I don't have much multi-GPU experience. The basic validation process for the repo is to run the following unit tests on a 2-T4 VM running our latest docker container, which tests everything in a mix of devices, and currently the unit tests are passing.

# Unit tests
rm -rf yolov5 && git clone https://github.com/ultralytics/yolov5 && cd yolov5
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv ./coco128 ../
for d in 0, 0,1 cpu # device
do
  for x in yolov5s #yolov5m yolov5l yolov5x # models
  do
    python detect.py --weights $x.pt --device $d
    python test.py --weights $x.pt --device $d

    python train.py --weights $x.pt --cfg $x.yaml --epochs 4 --img 320 --device $d

    python detect.py --weights weights/last.pt --device 0
    python test.py --weights weights/last.pt --device 0
    python test.py --weights weights/last.pt --device $d
  done
done

If you guys make any headway into these multi-gpu issues, please submit PRs for the benefit of everyone. Thank you!

ouening commented 4 years ago

I also experienced a similar issue as Justin when training on my custom dataset. However, I got quite peculiar results.

When training on 4 GPU, I got the forward error from the first post. This happened multiple times, so I tried to reduce the GPU count.

When training on 2 GPU, I got significantly worse mAP. 80-90(single)>47(2 GPU). I will try to reproduce this error. As I've chosen to use single GPU after experiencing this.

Code to run for single: CUDA_VISIBLE_DEVICES=0 python train.py --img 640 --batch 16 --epochs 1000 --data ./data/obj.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt --nosave --cache

I changed the cfg to follow my class count. For multiple, I just changed the DEVICES variable .

This code was cloned on the 24th. I am now testing the newest pull on single GPU first, then multiple later.

Thank you! I trained yolov5 successfully using single gpu.

NanoCode012 commented 4 years ago

@glenn-jocher , Unfortunately I am unable to modify the code directly as I don't have enough knowledge to do it yet.

One thing I noted was that multiple GPUs(8) works with detect.py and test.py but not with train.py, which is quite weird because the fail happens when train.py calls test.test to calculates mAP.

Another point is that I'm running it on a server which blocks other ports, so I'm not sure if it's related to

dist.init_process_group(backend='nccl',  # distributed backend
                                init_method='tcp://127.0.0.1:9999',  #<-- other process may not be able to communicate
                                world_size=1,  # number of nodes
                                rank=0)  # node rank

I do not have a local machine with multiple GPU to test this on.

Also, as I was reading on doing DistributedDataParallel on Pytorch, I saw that it is needed to spawn multiple process, 1 for each GPU and more, but I do not see that code in train.py.

import torch.multiprocessing as mp

mp.spawn(train....)

Also, setting the rank of the node by GPU isn't done dynamically, as you set it to rank=0 up above, so maybe all process sees itself as rank=0?

I apologize if I made any mistake when coming to these conclusions as I'm still learning.

diego0718 commented 4 years ago

Hi everybody, i am newbie but experimenting with my own dataset i faced this issue. I simply do this for fix the issue: In train.py line 155 i commented the line model = torch.nn.parallel.DistributedDataParallel(model) and add model=torch.nn.parallel.DataParallel(model) like this

model = torch.nn.parallel.DistributedDataParallel(model)

model = torch.nn.parallel.DataParallel(model)

For purpose i suppose inefficient use of my GPU units on pytorch 1.5 with this change but i get rid of this problem. I am still learning also

glenn-jocher commented 4 years ago

@NanoCode012 yes, I think the current best practices for multi-gpu in pytorch involves mp.spawn. There is a pytorch demo that shows how to do this here: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

One user benchmarked this implementation and saw much faster multi-gpu performance than the current implementation.

NanoCode012 commented 4 years ago

Hi everybody, i am newbie but experimenting with my own dataset i faced this issue. I simply do this for fix the issue: In train.py line 155 i commented the line model = torch.nn.parallel.DistributedDataParallel(model) and add model=torch.nn.parallel.DataParallel(model) like this

model = torch.nn.parallel.DistributedDataParallel(model)

model = torch.nn.parallel.DataParallel(model)

@diego0718 , When reading around, a lot of people suggested to stop using DataParallel and use DistributedDataParallel instead. Can you tell me if it ran faster( by how much) or had better performance?

@glenn-jocher , Thanks. I've been reading a few demos to get a grasp. I will try and see if I can get it working

AIpakchoi commented 4 years ago

Hi everybody, i have found the problem for this bug. The number of validation set must be divisible by 2 or 4 when using multi GPUs(2 or 4 GPUs). I get rid of this problem when using even number of valiation set.

xevolesi commented 4 years ago

Hi everybody, i have found the problem for this bug. The number of validation set must be divisible by 2 or 4 when using multi GPUs(2 or 4 GPUs). I get rid of this problem when using even number of valiation set.

Hi, @AIpakchoi, and thank you! Do you mean the number of images in validation set must be divisible by 2 or 4 wrt to number of GPUs?

AIpakchoi commented 4 years ago

Hi, @xevolesi, you are right. In train.py line 155 i commented the line model = torch.nn.parallel.DistributedDataParallel(model) and add model=torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True) like this

model = torch.nn.parallel.DistributedDataParallel(model)

model = torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True)

i have tried several experiments and found that the number of images in validation set must be divisible by 2 or 4 wrt to number of GPUs. This might be the problem of test.py.

NanoCode012 commented 4 years ago

@AIpakchoi .Thanks! I quickly tested with 16 pic for validation and it worked. However, I didn't need to set model = torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True) though.

Interestingly, I tried with an odd number for training set, and there were no problems.

Lastly, can you tell me if there was a performance drop? I experienced a large drop when using 2GPU compared to 1.

AIpakchoi commented 4 years ago

Hi,@NanoCode012, I did not train this code on COCO data set, i trained yolov5 on the KITTY data set for autonomous driving, so i didn't kown the performance on COCO when using 2GPU. But i can share you some tips with you, you can evolve hyp by running:
python train.py --epochs 100 --evolve. and i have found the confidence of prediction is affected by batch_size, i got satifactory performance when setting batch_size=8.

NanoCode012 commented 4 years ago

I tried to evolve with 300 epochs on one custom dataset I had. It didn't make a big difference compared to training it normally.

Although it did slightly increase the performance(for Precision) at the early stage compared to not evolving, but it evened out at the middle (my peak/plateau). I trained it for 1000 epochs on small dataset though, so maybe that's why.

I will try your advice on batch_size.

glenn-jocher commented 4 years ago

@NanoCode012 you might want to see https://github.com/ultralytics/yolov3/issues/392

NanoCode012 commented 4 years ago

@glenn-jocher , thank you for that. I read through it before. I got the below(the best one, others weren't as high fitness) and replaced it into the train.py hyp variable but it didn't change much. I also tried to change the number of times it evolve 10->20->30 till I got memory error.

glenn-jocher commented 4 years ago

Ah great! Your results show that momentum evolved to a much higher value than default 😮

zoezhu commented 4 years ago

Hi, @xevolesi, you are right. In train.py line 155 i commented the line model = torch.nn.parallel.DistributedDataParallel(model) and add model=torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True) like this

model = torch.nn.parallel.DistributedDataParallel(model)

model = torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True)

i have tried several experiments and found that the number of images in validation set must be divisible by 2 or 4 wrt to number of GPUs. This might be the problem of test.py.

I think you are right, test dataset should be divisible by batch-size, otherwise it will run to 'TypeError: forward() missing 1 required positional argument: 'x'.

NanoCode012 commented 4 years ago

Interestingly.. it doesn't happen anymore. Maybe a commit fixed it. @zoezhu , are you on latest version?

zoezhu commented 4 years ago

Interestingly.. it doesn't happen anymore. Maybe a commit fixed it. @zoezhu , are you on latest version?

I git clone it three days ago, but I run it with my own dataset, I don't know whether it cause this error.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ultralytics / yolov5

TypeError: forward() missing 1 required positional argument: 'x' #147

🐛 Bug

To Reproduce (REQUIRED)

Expected behavior

Environment

Additional context

model = torch.nn.parallel.DistributedDataParallel(model)

model = torch.nn.parallel.DistributedDataParallel(model)

model = torch.nn.parallel.DistributedDataParallel(model)

model = torch.nn.parallel.DistributedDataParallel(model)