Closed erdongchendou closed 4 years ago
Hello @erdongchendou, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
@erdongchendou there have been many changes on the training side. To reproduce see https://github.com/ultralytics/yolov3#reproduce-our-results
@glenn-jocher Thank you for your prompt reply.
#!/usr/bin/env bash
log=logs/train_coco_yolov3_spp.py
nohup python3 -u train.py --weights '' --cfg yolov3-spp.cfg \
--epochs 273 --batch 16 --accum 4 --multi \
--data data/coco2014.data --device 8 >> $log >&1 &
tail -f $log
When I use the command you provide to train yolov3 from scratch on coco2014, at first it is traing OK, after 16 iters it is out of memory. But my gpu is GeForce RTX 2080 Ti just as you mentioned in your README.md, is there anything I did wrong? My Driver Version is 418.43 and CUDA Version is 10.1. Fllowing it the error log.
Namespace(accumulate=4, adam=False, batch_size=16, bucket='', cache_images=False, cfg='yolov3-spp.cfg', data='data/coco2014.data', device='3', epochs=273, evolve=False, img_size=[416], multi_scale=True, name='', nosave=False, notest=False, rect=False, resume=False, single_cls=False, var=None, weights='')
Using CUDA device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=10989MB)
Using multi-scale 288 - 640
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients
Caching labels (117263 found, 0 missing, 0 empty, 0 duplicate, for 117263 images): 100%|██████████| 117263/117263 [00:17<00:00, 6743.68it/s]
Caching labels (4954 found, 46 missing, 0 empty, 0 duplicate, for 5000 images): 100%|██████████| 5000/5000 [00:00<00:00, 6084.72it/s]
Using 8 dataloader workers
Starting training for 273 epochs...
Epoch gpu_mem GIoU obj cls total targets img_size
0/272 10.4G 6.81 9.48 7.55 23.8 282 416: 0%| | 16/7329 [00:11<56:35, 2.15it/s]Traceback (most recent call last):
File "train.py", line 433, in <module>
train() # train normally
File "train.py", line 272, in train
pred = model(imgs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/xhzyssd/chenleilei/PycharmProjects/yolov3/models.py", line 275, in forward
x = module(x)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/batchnorm.py", line 83, in forward
exponential_average_factor, self.eps)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 1697, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.73 GiB total capacity; 9.59 GiB already allocated; 21.56 MiB free; 296.53 MiB cached)
0/272 10.4G 6.81 9.48 7.55 23.8 282 416: 0%| | 16/7329 [00:12<1:33:37, 1.30it/s]
@erdongchendou you don't have apex installed. You need to install nvidia apex, then it will automatically detect and use it.
You can tell if it is operating correctly because it will say 'Using CUDA Apex' instead of 'Using CUDA'
Thank you very much. After I install apex as your instruction in requirements.txt , my traing is doing well now.
Great!
After 169.115 hours training on GeForce RTX 2080 Ti, I got almost the same map reported in the README.md. My map is as follows:
Speed: 9.8/2.1/11.9 ms inference/NMS/total per 608x608 image at batch-size 32
COCO mAP with pycocotools...
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.415
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.614
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.443
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.246
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.455
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.522
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.340
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.552
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.602
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.439
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.639
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.730
Thank you very much.
@erdongchendou great! We updated NMS to a new SOTA type in the last few days also called Merge NMS, which should bump your mAP a bit as well on your trained model. If you git pull and re-run you should see this higher mAP :)
After I pull the newest code and test on my trained model, the Map is as follows, which increases a bit indeed.
OCO mAP with pycocotools...
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.417
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.614
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.446
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.248
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.458
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.524
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.342
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.557
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.608
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.444
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.649
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.738
By the way, if I train on my own data, do you suggest me to train from scratch or train on a pretained model? If the latter, which pretrained model do you suggest me to use? The model pretrained on ImageNet classification data or the model pretrained on coco detection data?
'yolov3-spp.weights': '16lYS4bcIdM2HdmyJBVDOvt3Trx6N3W2R',
'yolov3.weights': '1uTlyDWlnaqXcsKOktP5aH_zRDbfcDp-y',
'yolov3-tiny.weights': '1CCF-iNIIkYesIDzaPvdwlcf7H9zSsKZQ',
'yolov3-spp.pt': '1f6Ovy3BSq2wYq4UfvFUpxJFNDFfrIDcR',
'yolov3.pt': '1SHNFyoe5Ni8DajDNEqgB2oVKBb_NoEad',
'yolov3-tiny.pt': '10m_3MlpQwRtZetQxtksm9jqHrPTHZ6vo',
'darknet53.conv.74': '1WUVBid-XuoUBmvzBVUCBl_ELrzqwA8dJ',
'yolov3-tiny.conv.15': '1Bw0kCpplxUqyRYAJr9RY9SGnOJbo9nEj',
'ultralytics49.pt': '158g62Vs14E3aj7oPVPuEnNZMKFNgGyNq',
'ultralytics68.pt': '1Jm8kqnMdMGUUxGo8zMFZMJ0eaPwLkxSG',
'yolov3-spp-ultralytics.pt': '1UcR-zVoMs7DH5dj3N1bswkiQTA4dmKF4'
@erdongchendou great! If you have a large dataset like COCO train from scratch. If you have a smaller dataset then start from the pretrianed yolov3-spp-ultralytics.pt
, this is the best model.
Got it. Thank you very much.
@glenn-jocher Hi Glenn, on what coco data did you train your weights on? I mean i want to know a weight file that is trained on coco2014.
@joel5638 for this repo we are sticking with COCO2014, for easy comparisons with darknet-trained yolov3.
The default weights (yolov3-spp-ultralytics.pt) are trained on COCO2014. The weights download automatically (if they are not already present on your system) when running python3 test.py
or python3 detect.py
for example.
@joel5638 if you are in doubt if you have the latest weights, you can simply delete all of your local *.pt files and then run python3 detect.py
to download the latest.
@glenn-jocher oh okay sure Glenn. Thank you.
This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.
@joel5638 you're welcome! If you have any more questions, feel free to ask.
Thank you very much for this amazing repository!
The Map@0.5...0.95 of your pretrained model YOLOv3-SPP-ultralytics is at leat 5% higher than the official yolov3 at different model size(320, 416, 512, 608) , which is significent improvement! I have tested the pretrained model you provide, and I get almost the same map as you mentioned in the README.md.
My question is what did you to improve your map so significently?
How can I reproduce your experiment? Which pretrained model should I use? Is it the official pretrained model darknet53.conv.74? Should I just use this command?
python3 train.py --weights '' --cfg yolov3-spp.cfg --epochs 273 --batch 16 --accum 4 --multi