ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.16k stars 3.44k forks source link

how to reproduce your YOLOv3-SPP-ultralytics map? #941

Closed erdongchendou closed 4 years ago

erdongchendou commented 4 years ago

Thank you very much for this amazing repository!

The Map@0.5...0.95 of your pretrained model YOLOv3-SPP-ultralytics is at leat 5% higher than the official yolov3 at different model size(320, 416, 512, 608) , which is significent improvement! I have tested the pretrained model you provide, and I get almost the same map as you mentioned in the README.md.

My question is what did you to improve your map so significently?

How can I reproduce your experiment? Which pretrained model should I use? Is it the official pretrained model darknet53.conv.74? Should I just use this command? python3 train.py --weights '' --cfg yolov3-spp.cfg --epochs 273 --batch 16 --accum 4 --multi

github-actions[bot] commented 4 years ago

Hello @erdongchendou, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

glenn-jocher commented 4 years ago

@erdongchendou there have been many changes on the training side. To reproduce see https://github.com/ultralytics/yolov3#reproduce-our-results

erdongchendou commented 4 years ago

@glenn-jocher Thank you for your prompt reply.

#!/usr/bin/env bash
log=logs/train_coco_yolov3_spp.py
nohup python3 -u train.py --weights '' --cfg yolov3-spp.cfg \
    --epochs 273 --batch 16 --accum 4 --multi \
    --data data/coco2014.data --device 8 >> $log >&1 &

tail -f $log

When I use the command you provide to train yolov3 from scratch on coco2014, at first it is traing OK, after 16 iters it is out of memory. But my gpu is GeForce RTX 2080 Ti just as you mentioned in your README.md, is there anything I did wrong? My Driver Version is 418.43 and CUDA Version is 10.1. Fllowing it the error log.

Namespace(accumulate=4, adam=False, batch_size=16, bucket='', cache_images=False, cfg='yolov3-spp.cfg', data='data/coco2014.data', device='3', epochs=273, evolve=False, img_size=[416], multi_scale=True, name='', nosave=False, notest=False, rect=False, resume=False, single_cls=False, var=None, weights='')
Using CUDA device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=10989MB)

Using multi-scale 288 - 640
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients
Caching labels (117263 found, 0 missing, 0 empty, 0 duplicate, for 117263 images): 100%|██████████| 117263/117263 [00:17<00:00, 6743.68it/s]
Caching labels (4954 found, 46 missing, 0 empty, 0 duplicate, for 5000 images): 100%|██████████| 5000/5000 [00:00<00:00, 6084.72it/s]
Using 8 dataloader workers
Starting training for 273 epochs...

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     0/272     10.4G      6.81      9.48      7.55      23.8       282       416:   0%|          | 16/7329 [00:11<56:35,  2.15it/s]Traceback (most recent call last):
  File "train.py", line 433, in <module>
    train()  # train normally
  File "train.py", line 272, in train
    pred = model(imgs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/xhzyssd/chenleilei/PycharmProjects/yolov3/models.py", line 275, in forward
    x = module(x)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.73 GiB total capacity; 9.59 GiB already allocated; 21.56 MiB free; 296.53 MiB cached)
     0/272     10.4G      6.81      9.48      7.55      23.8       282       416:   0%|          | 16/7329 [00:12<1:33:37,  1.30it/s]
glenn-jocher commented 4 years ago

@erdongchendou you don't have apex installed. You need to install nvidia apex, then it will automatically detect and use it.

You can tell if it is operating correctly because it will say 'Using CUDA Apex' instead of 'Using CUDA'

glenn-jocher commented 4 years ago

https://github.com/NVIDIA/apex

erdongchendou commented 4 years ago

Thank you very much. After I install apex as your instruction in requirements.txt , my traing is doing well now.

glenn-jocher commented 4 years ago

Great!

erdongchendou commented 4 years ago

After 169.115 hours training on GeForce RTX 2080 Ti, I got almost the same map reported in the README.md. My map is as follows:

Speed: 9.8/2.1/11.9 ms inference/NMS/total per 608x608 image at batch-size 32

COCO mAP with pycocotools...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.415
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.614
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.443
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.246
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.455
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.522
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.340
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.552
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.602
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.439
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.639
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.730

Thank you very much.

glenn-jocher commented 4 years ago

@erdongchendou great! We updated NMS to a new SOTA type in the last few days also called Merge NMS, which should bump your mAP a bit as well on your trained model. If you git pull and re-run you should see this higher mAP :)

erdongchendou commented 4 years ago

After I pull the newest code and test on my trained model, the Map is as follows, which increases a bit indeed.

OCO mAP with pycocotools...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.417
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.614
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.446
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.248
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.458
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.524
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.557
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.608
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.444
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.649
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.738

By the way, if I train on my own data, do you suggest me to train from scratch or train on a pretained model? If the latter, which pretrained model do you suggest me to use? The model pretrained on ImageNet classification data or the model pretrained on coco detection data?

'yolov3-spp.weights': '16lYS4bcIdM2HdmyJBVDOvt3Trx6N3W2R',
'yolov3.weights': '1uTlyDWlnaqXcsKOktP5aH_zRDbfcDp-y',
'yolov3-tiny.weights': '1CCF-iNIIkYesIDzaPvdwlcf7H9zSsKZQ',
'yolov3-spp.pt': '1f6Ovy3BSq2wYq4UfvFUpxJFNDFfrIDcR',
'yolov3.pt': '1SHNFyoe5Ni8DajDNEqgB2oVKBb_NoEad',
'yolov3-tiny.pt': '10m_3MlpQwRtZetQxtksm9jqHrPTHZ6vo',
'darknet53.conv.74': '1WUVBid-XuoUBmvzBVUCBl_ELrzqwA8dJ',
'yolov3-tiny.conv.15': '1Bw0kCpplxUqyRYAJr9RY9SGnOJbo9nEj',
'ultralytics49.pt': '158g62Vs14E3aj7oPVPuEnNZMKFNgGyNq',
'ultralytics68.pt': '1Jm8kqnMdMGUUxGo8zMFZMJ0eaPwLkxSG',
'yolov3-spp-ultralytics.pt': '1UcR-zVoMs7DH5dj3N1bswkiQTA4dmKF4'
glenn-jocher commented 4 years ago

@erdongchendou great! If you have a large dataset like COCO train from scratch. If you have a smaller dataset then start from the pretrianed yolov3-spp-ultralytics.pt, this is the best model.

erdongchendou commented 4 years ago

Got it. Thank you very much.

joel5638 commented 4 years ago

@glenn-jocher Hi Glenn, on what coco data did you train your weights on? I mean i want to know a weight file that is trained on coco2014.

glenn-jocher commented 4 years ago

@joel5638 for this repo we are sticking with COCO2014, for easy comparisons with darknet-trained yolov3.

The default weights (yolov3-spp-ultralytics.pt) are trained on COCO2014. The weights download automatically (if they are not already present on your system) when running python3 test.py or python3 detect.py for example.

glenn-jocher commented 4 years ago

@joel5638 if you are in doubt if you have the latest weights, you can simply delete all of your local *.pt files and then run python3 detect.py to download the latest.

joel5638 commented 4 years ago

@glenn-jocher oh okay sure Glenn. Thank you.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

glenn-jocher commented 10 months ago

@joel5638 you're welcome! If you have any more questions, feel free to ask.