Closed buaazsj closed 3 years ago
@buaazsj there are a few issues open regarding --resume that talk about this, you might want to search the issues a bit.
Is there any update on this issue? I've been facing the same problem, low mAP when training on 2/4 gpu setting. Training on 1-gpu works perfectly fine.
Edit. when training in a multi-gpu setting, the training loss (gIOU, cls, and obj) are the same as 1-gpu setting. Its only the validation loss + mAP that is reduced.
Edit 2. Ok so this is weird. Evaluation using :func: test.test during training on multi-gpu gives low mAP. BUT, if you separately evaluate using the checkpoint that was saved to disk by running python test.py .... , for the same epoch, you get correct mAP! By correct, I mean mAP is similar to training with 1-gpu setting. I'm testing this on yolov3-tiny.cfg. Will report more when training finishes in 1-2 days.
@akshaychawla hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:
Your changes to the default repository. If your issue is not reproducible in a new git clone
version of this repository we can not debug it. Before going further run this code and ensure your issue persists:
sudo rm -rf yolov5 # remove existing
git clone https://github.com/ultralytics/yolov5 && cd yolov5 # clone latest
python detect.py # verify detection
# CODE TO REPRODUCE YOUR ISSUE HERE
Your custom data. If your issue is not reproducible with COCO or COCO128 data we can not debug it. Visit our Custom Training Tutorial for guidelines on training your custom data. Examine train_batch0.jpg
and test_batch0.jpg
for a sanity check of training and testing data.
Your environment. If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, ensure your environment meets all of the requirements.txt dependencies specified below.
If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!
Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6
. To install run:
$ pip install -r requirements.txt
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.
Is there any update on this issue? I've been facing the same problem, low mAP when training on 2/4 gpu setting. Training on 1-gpu works perfectly fine.
Edit. when training in a multi-gpu setting, the training loss (gIOU, cls, and obj) are the same as 1-gpu setting. Its only the validation loss + mAP that is reduced.
Edit 2. Ok so this is weird. Evaluation using :func: test.test during training on multi-gpu gives low mAP. BUT, if you separately evaluate using the checkpoint that was saved to disk by running python test.py .... , for the same epoch, you get correct mAP! By correct, I mean mAP is similar to training with 1-gpu setting. I'm testing this on yolov3-tiny.cfg. Will report more when training finishes in 1-2 days.
I met the same question. Have you solved it?
Ultralytics has open-sourced YOLOv5 at https://github.com/ultralytics/yolov5, featuring faster, lighter and more accurate object detection. YOLOv5 is recommended for all new projects.
** GPU Speed measures end-to-end time per image averaged over 5000 COCO val2017 images using a V100 GPU with batch size 32, and includes image preprocessing, PyTorch FP16 inference, postprocessing and NMS. EfficientDet data from google/automl at batch size 8.
Model | APval | APtest | AP50 | SpeedGPU | FPSGPU | params | FLOPS | |
---|---|---|---|---|---|---|---|---|
YOLOv5s | 37.0 | 37.0 | 56.2 | 2.4ms | 416 | 7.5M | 13.2B | |
YOLOv5m | 44.3 | 44.3 | 63.2 | 3.4ms | 294 | 21.8M | 39.4B | |
YOLOv5l | 47.7 | 47.7 | 66.5 | 4.4ms | 227 | 47.8M | 88.1B | |
YOLOv5x | 49.2 | 49.2 | 67.7 | 6.9ms | 145 | 89.0M | 166.4B | |
YOLOv5x + TTA | 50.8 | 50.8 | 68.9 | 25.5ms | 39 | 89.0M | 354.3B | |
YOLOv3-SPP | 45.6 | 45.5 | 65.2 | 4.5ms | 222 | 63.0M | 118.0B |
APtest denotes COCO test-dev2017 server results, all other AP results in the table denote val2017 accuracy.
All AP numbers are for single-model single-scale without ensemble or test-time augmentation. Reproduce by python test.py --data coco.yaml --img 640 --conf 0.001
SpeedGPU measures end-to-end time per image averaged over 5000 COCO val2017 images using a GCP n1-standard-16 instance with one V100 GPU, and includes image preprocessing, PyTorch FP16 image inference at --batch-size 32 --img-size 640, postprocessing and NMS. Average NMS time included in this chart is 1-2ms/img. Reproduce by python test.py --data coco.yaml --img 640 --conf 0.1
All checkpoints are trained to 300 epochs with default settings and hyperparameters (no autoaugmentation).
Test Time Augmentation (TTA) runs at 3 image sizes. Reproduce** by python test.py --data coco.yaml --img 832 --augment
For more information and to get started with YOLOv5 please visit https://github.com/ultralytics/yolov5. Thank you!
Here are the results:
issue When training with multiple GPUs on DDP, validation mAP is lower than expected.
observation If we test a serialized version of the ddp model (i.e serialize to disk using torch.save and then serialize using torch.load), the validation mAP is fine.
fix During training instead of running test.test using the model that is being currently trained using DDP, write a temporary checkpoint to disk and tell test to create a new model and load that checkpoint and run validation with it.
Why does it work? No idea. Just dumb luck I guess. If I had to guess, it looks like when we move model.module.yolo_layers to model.yolo_layers after DDP init, and then switch to eval using model.eval(), it may not be switching eval on for the yolo_layers since they're outside model.module. OR, it might have something to do with the test dataloader.
The minor changes required for train.py & test.py can be viewed here: https://github.com/akshaychawla/yolov3/commit/4e5eaab4907e037579f6690296fe3d556c621d4e
Another observation is that when I increased batch-size to 128 and train the model, I was expecting it to perform worse because higher batch size means less gradient updates which means less performance. BUT, this repository has support for gradient accumulation using loss *= batchsize / 64 which I think was initially designed for cases where batchsize < 64. But for bs>64, it has the effect of scaling up the loss, which is same as scaling the learning rate hyp['lr0'] to accommodate higher batch-sizes. But more importantly, it speed up training.
#gpu | Time to 300 epochs (hrs) |
---|---|
1 | 42.843 |
2 | 36.740 |
4 | 26.455 |
Logs, results and checkpoints available at: Edit. https://drive.google.com/drive/folders/11CN50wq-e0COr9RnWPtBHgj8z-AfFBq6?usp=sharing
@akshaychawla thanks so much for the detailed analysis! It looks like you've put in a lot of work and arrived at a very useful insight.
I would highly recommend you try YOLOv5, it has a multitude of feature additions, improvements and bug fixes above and beyond this YOLOv3 repo. DDP is functioning correctly there, we use for training the largest official YOLOv5x model with no problems. https://github.com/ultralytics/yolov5
@glenn-jocher Thank YOU for building, open-sourcing and then maintaining this and yolov5 repository!
this is a very well written piece of software and I have learnt so much from it. Would love to transition to Yolov5, but reviewer 3 will ask me to compare to "existing state of the art" before a borderline reject, so my arms are tied to v3 for now 🤷♂️
@glenn-jocher Yolov3-spp-1 in yolov3 is used cls.cfg After training with yolov5x in yolov5, we get the following results: yolov3 is better than yolov5, why? yolov3:
yolov5:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@goldwater668 It's great to hear about the comparison! YOLOv3 in the YOLOv5 repository features an updated architecture that may outperform YOLOv5 in certain scenarios. YOLOv5, however, offers a significant number of improvements and optimizations over YOLOv3. I would recommend reviewing the recent updates and optimizations in YOLOv5 to ensure an Apple's-to-Apple's comparison. Thank you!
❔Question
When multi-GPU training, my validation map value is very low. But when the training is interrupted and then resumed, it becomes normal. Why?
Additional context