Closed ouening closed 4 years ago
Hello @ouening, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:
For more information please visit https://www.ultralytics.com.
@ouening if this occurs on your data and not coco128.yaml, then the cause will be your data. This guide might help: https://blog.roboflow.ai/how-to-train-yolov5-on-a-custom-dataset/
Thank you! The same dataset can be trained using yolov3 successfully. I'll try later
I get this error when using multi GPUs. The temporary fix is to use only a single GPU with --device 0
@JustinMBrown thanks for the feedback. Does this error also appear in the docker container? If so please give us instructions to reproduce this error, ideally on a dataset we can access like coco128.yaml. Thanks!
I also experienced a similar issue as Justin when training on my custom dataset. However, I got quite peculiar results.
When training on 4 GPU, I got the forward error
from the first post. This happened multiple times, so I tried to reduce the GPU count.
When training on 2 GPU, I got significantly worse mAP. 80-90(single)>47(2 GPU). I will try to reproduce this error. As I've chosen to use single GPU after experiencing this.
Code to run for single: CUDA_VISIBLE_DEVICES=0 python train.py --img 640 --batch 16 --epochs 1000 --data ./data/obj.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt --nosave --cache
I changed the cfg to follow my class count. For multiple, I just changed the DEVICES variable .
This code was cloned on the 24th. I am now testing the newest pull on single GPU first, then multiple later.
@NanoCode012 thanks. The reduced multi-gpu performance may be due to batch norm layers not synchronizing across gpus. I think there's a syncbatchnorm layer that might be used to fix this. We do all of our training on single GPUs, so it's not something we've explored in depth.
To use a specific cuda device:
python train.py --device 2
To use specific multiple devices:
python train.py --device 0,1,2,3
@glenn-jocher .
I just tried to run with 4 GPUs(and 2 GPUs) via :
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --img 640 --batch 16 --epochs 1000 --data ./data/obj.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt --nosave --cache --device 0,1,2,3
python train.py --img 640 --batch 16 --epochs 1000 --data ./data/obj.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt --nosave --cache --device 0,1,2,3
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --img 640 --batch 16 --epochs 1000 --data ./data/obj.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt --nosave --cache
A mix of with and without the devices argument.
They still gave the same error forward() missing 1 required..
when calculating mAP of first epoch.
I also noted that
UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. NB: There is a known issue in nn.parallel.replicate that prevents a single DDP instance to operate on multiple model replicas.
occurs no matter if I specify CUDA_VISIBLE_DEVICES
or just --device
As of now, I can't run it with more than 1 GPU on the latest code base.
Hi, @glenn-jocher!
Also got TypeError: forward() missing 1 required postional argument: 'x'
Custom single class dataset, 4x Tesla V100, batch 2, 4, 8, etc, input size 1024. Used 2, 3 and 4 GPUs. Now trying with single GPU.
@NanoCode012 @xevolesi @JustinMBrown thanks everyone. To be honest, the people that are best positioned to debug these multi-gpu bugs are you guys. We do most of our training on single-GPU. I've got 8 VMs running single-GPU experiments at the moment, as this is the most efficient cost structure in terms of FLOPS/$. This means I don't have much multi-GPU experience. The basic validation process for the repo is to run the following unit tests on a 2-T4 VM running our latest docker container, which tests everything in a mix of devices, and currently the unit tests are passing.
# Unit tests
rm -rf yolov5 && git clone https://github.com/ultralytics/yolov5 && cd yolov5
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv ./coco128 ../
for d in 0, 0,1 cpu # device
do
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
python detect.py --weights $x.pt --device $d
python test.py --weights $x.pt --device $d
python train.py --weights $x.pt --cfg $x.yaml --epochs 4 --img 320 --device $d
python detect.py --weights weights/last.pt --device 0
python test.py --weights weights/last.pt --device 0
python test.py --weights weights/last.pt --device $d
done
done
If you guys make any headway into these multi-gpu issues, please submit PRs for the benefit of everyone. Thank you!
I also experienced a similar issue as Justin when training on my custom dataset. However, I got quite peculiar results.
When training on 4 GPU, I got the
forward error
from the first post. This happened multiple times, so I tried to reduce the GPU count.When training on 2 GPU, I got significantly worse mAP. 80-90(single)>47(2 GPU). I will try to reproduce this error. As I've chosen to use single GPU after experiencing this.
Code to run for single:
CUDA_VISIBLE_DEVICES=0 python train.py --img 640 --batch 16 --epochs 1000 --data ./data/obj.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt --nosave --cache
I changed the cfg to follow my class count. For multiple, I just changed the DEVICES variable .
This code was cloned on the 24th. I am now testing the newest pull on single GPU first, then multiple later.
Thank you! I trained yolov5 successfully using single gpu.
@glenn-jocher , Unfortunately I am unable to modify the code directly as I don't have enough knowledge to do it yet.
One thing I noted was that multiple GPUs(8) works with detect.py
and test.py
but not with train.py
, which is quite weird because the fail happens when train.py
calls test.test
to calculates mAP.
Another point is that I'm running it on a server which blocks other ports, so I'm not sure if it's related to
dist.init_process_group(backend='nccl', # distributed backend
init_method='tcp://127.0.0.1:9999', #<-- other process may not be able to communicate
world_size=1, # number of nodes
rank=0) # node rank
I do not have a local machine with multiple GPU to test this on.
Also, as I was reading on doing DistributedDataParallel on Pytorch, I saw that it is needed to spawn multiple process, 1 for each GPU and more, but I do not see that code in train.py
.
import torch.multiprocessing as mp
mp.spawn(train....)
Also, setting the rank of the node by GPU isn't done dynamically, as you set it to rank=0
up above, so maybe all process sees itself as rank=0
?
I apologize if I made any mistake when coming to these conclusions as I'm still learning.
Hi everybody, i am newbie but experimenting with my own dataset i faced this issue. I simply do this for fix the issue: In train.py line 155 i commented the line model = torch.nn.parallel.DistributedDataParallel(model) and add model=torch.nn.parallel.DataParallel(model) like this
model = torch.nn.parallel.DataParallel(model)
For purpose i suppose inefficient use of my GPU units on pytorch 1.5 with this change but i get rid of this problem. I am still learning also
@NanoCode012 yes, I think the current best practices for multi-gpu in pytorch involves mp.spawn. There is a pytorch demo that shows how to do this here: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
One user benchmarked this implementation and saw much faster multi-gpu performance than the current implementation.
Hi everybody, i am newbie but experimenting with my own dataset i faced this issue. I simply do this for fix the issue: In train.py line 155 i commented the line model = torch.nn.parallel.DistributedDataParallel(model) and add model=torch.nn.parallel.DataParallel(model) like this
model = torch.nn.parallel.DistributedDataParallel(model)
model = torch.nn.parallel.DataParallel(model)
@diego0718 , When reading around, a lot of people suggested to stop using DataParallel and use DistributedDataParallel instead. Can you tell me if it ran faster( by how much) or had better performance?
@glenn-jocher , Thanks. I've been reading a few demos to get a grasp. I will try and see if I can get it working
Hi everybody, i have found the problem for this bug. The number of validation set must be divisible by 2 or 4 when using multi GPUs(2 or 4 GPUs). I get rid of this problem when using even number of valiation set.
Hi everybody, i have found the problem for this bug. The number of validation set must be divisible by 2 or 4 when using multi GPUs(2 or 4 GPUs). I get rid of this problem when using even number of valiation set.
Hi, @AIpakchoi, and thank you! Do you mean the number of images in validation set must be divisible by 2 or 4 wrt to number of GPUs?
Hi, @xevolesi, you are right. In train.py line 155 i commented the line model = torch.nn.parallel.DistributedDataParallel(model) and add model=torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True) like this
model = torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True)
i have tried several experiments and found that the number of images in validation set must be divisible by 2 or 4 wrt to number of GPUs. This might be the problem of test.py.
@AIpakchoi .Thanks! I quickly tested with 16 pic for validation and it worked. However, I didn't need to set model = torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True)
though.
Interestingly, I tried with an odd number for training set, and there were no problems.
Lastly, can you tell me if there was a performance drop? I experienced a large drop when using 2GPU compared to 1.
Hi,@NanoCode012, I did not train this code on COCO data set, i trained yolov5 on the KITTY data set for autonomous driving, so i didn't kown the performance on COCO when using 2GPU. But i can share you some tips with you, you can evolve hyp by running:
python train.py --epochs 100 --evolve.
and i have found the confidence of prediction is affected by batch_size, i got satifactory performance when setting batch_size=8.
I tried to evolve with 300 epochs on one custom dataset I had. It didn't make a big difference compared to training it normally.
Although it did slightly increase the performance(for Precision) at the early stage compared to not evolving, but it evened out at the middle (my peak/plateau). I trained it for 1000 epochs on small dataset though, so maybe that's why.
I will try your advice on batch_size.
@NanoCode012 you might want to see https://github.com/ultralytics/yolov3/issues/392
@glenn-jocher , thank you for that. I read through it before. I got the below(the best one, others weren't as high fitness) and replaced it into the train.py
hyp variable but it didn't change much. I also tried to change the number of times it evolve 10->20->30 till I got memory error.
Ah great! Your results show that momentum evolved to a much higher value than default ๐ฎ
Hi, @xevolesi, you are right. In train.py line 155 i commented the line model = torch.nn.parallel.DistributedDataParallel(model) and add model=torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True) like this
model = torch.nn.parallel.DistributedDataParallel(model)
model = torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True)
i have tried several experiments and found that the number of images in validation set must be divisible by 2 or 4 wrt to number of GPUs. This might be the problem of test.py.
I think you are right, test dataset should be divisible by batch-size, otherwise it will run to 'TypeError: forward() missing 1 required positional argument: 'x'.
Interestingly.. it doesn't happen anymore. Maybe a commit fixed it. @zoezhu , are you on latest version?
Interestingly.. it doesn't happen anymore. Maybe a commit fixed it. @zoezhu , are you on latest version?
I git clone it three days ago, but I run it with my own dataset, I don't know whether it cause this error.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:
git fetch && git status -uno
to check andgit pull
to update repoIf this is a custom dataset/training question you must include your
train*.jpg
,test*.jpg
andresults.png
figures, or we can not help you. You can generate these withutils.plot_results()
.๐ Bug
Hi, I'm training yolov5 using custom dataset with 8 classes. The dataset is in other directory (does it necessary to be next to the
/yolov5
directory? ). Structure of dataset is:Content of yml file is:
To Reproduce (REQUIRED)
Input:
Output:
Expected behavior
It should train normally, but it failed after the first epoch test, I don't know why? Is it beacuse the dataset is not next to
/yolov5
?Environment
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.