ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.16k stars 3.44k forks source link

What metrics are used in the training? #330

Closed ardianumam closed 5 years ago

ardianumam commented 5 years ago

Hi, Thanks for the awesome work! I've been training using default setting and now is in epoch 241 with the result below (Fig. 1). I have some inquires below.

  1. Is the result you report here and mAP result in Fig. 1 calculated with same metric?
  2. Continuing (1), what is the metric for the mAP? (i) 0.5 or (ii) 0.5:0.05:0.95? Because for (ii), the original paper reports ~33%.
  3. Still continuing (1), what is also the metric of the R (recall)? Is it: (i) general recall: recall for all objects (instances), or (ii) ARmax=1/10/100 as in COCO here?
  4. So far, the recall is still only ~54.7% (Fig. 1), is it true to expect recall value of 77.3% in the end of epoch? (274 epochs)

Many thanks.

Fig. 1:

glenn-jocher commented 5 years ago

Hi, Thanks for the awesome work! I've been training using default setting and now is in epoch 241 with the result below (Fig. 1). I have some inquires below.

  1. Is the result you report here and mAP result in Fig. 1 calculated with same metric?

Result reported in https://github.com/ultralytics/yolov3#map is from pycocotools mAP (the official mAP). You get this by running python3 test.py --save-json as in the examples shown in that section.

  1. Continuing (1), what is the metric for the mAP? (i) 0.5 or (ii) 0.5:0.05:0.95? Because for (ii), the original paper reports ~33%.

0.5

  1. Still continuing (1), what is also the metric of the R (recall)? Is it: (i) general recall: recall for all objects (instances), or (ii) ARmax=1/10/100 as in COCO here?

Recall in this repo is simply TP / ground truth number of objects. Remember pycocotools also reports their own seperate statistics, which you can see using the --save-json flag with test.py.

  1. So far, the recall is still only ~54.7% (Fig. 1), is it true to expect recall value of 77.3% in the end of epoch? (274 epochs)

Actually, we believe we misinterpreted darknet training settings, so you should only need 68 epochs now to reach full training (1/4 of the previous number). The latest updates has this change implemented. We are also working on GIoU tests, which we may commit in the next few days.

Many thanks.

Fig. 1:

ardianumam commented 5 years ago
  1. Actually, we believe we misinterpreted darknet training settings, so you should only need 68 epochs now to reach full training (1/4 of the previous number). The latest updates has this change implemented. We are also working on GIoU tests, which we may commit in the next few days.

That's really a good news. For 68 epochs, is it using: (i) pre-trained model or (ii) from-the-scratch training? My screenshot result above is for (ii). Also, may I know what the misinterpreted training setting was? This info may be helpful for me in the next similar project.

For the GIoU, will be very happy to wait. If you don't mind, please write in the main readme file for the update so that we all know it. Thanks a lot!

ardianumam commented 5 years ago

Hi @glenn-jocher ,

Here is my result of 'from-the-scratch' training (not using pre-training) until latest epoch: 272 using this repo code. It gets 40.9%, 68.6% and 4.17% for mAP, R and P, respectively, which is still quite far from the reported table here. Is this related to the misinterpreted one you mentioned above and should be solved if I train 'from-the-scratch' again using latest commit version?

Many thanks!

glenn-jocher commented 5 years ago

@ardianumam unfortunately we are also seeing subpar performance compared to darknet when trained from scratch on coco. The latest training we did was to 68 epochs, using single-scale, and with --giou enabled. This produced 0.464 mAP@0.5 with pycocotools (using python3 test.py --save-json). So we are unfortunately quite far from darknet performance when training from scratch. As mentioned before we need some hyperparameter tuning, we also need to implement an ignore_thres like darknet, as well as train with multi-scale. There may be imbalances between obj and no_obj as well. At this point we should try and duplicate darknet functionality as much as possible to try to approach their results.

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.248
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.464
ardianumam commented 5 years ago

@glenn-jocher : thanks for the quick reply! Yes, comparing 'from-the-scratch' training result to the original paper accuracy, it may have differences: (i) multi-scale training (as you said above), and (ii) the use of imageNet pretrained model, 160 epochs for initial 224x224 resolution followed by 10 fine-tuning-epochs for 448x448 resolution which is claimed having 4% mAP increase.

So, for the table result you report here, in the ultralytics/yolov3 column, how did you train to get those mAP, including the yolov3-spp versions?

glenn-jocher commented 5 years ago

@ardianumam the results in https://github.com/ultralytics/yolov3#map are using the official darknet weights. You can see this in the argparser printout, and can reproduce this mAP yourself running the same command. This means that the inference portion of this repo is validated, the testing portion produces a mAP calculation to about 1% of pycocotools, it is simply the training portion that is not up to par with darknet yet.

python3 test.py --save-json --img-size 416
Namespace(batch_size=32, cfg='cfg/yolov3-spp.cfg', conf_thres=0.001, data_cfg='data/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3-spp.weights')
Using CUDA device0 _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', total_memory=16130MB)
               Class    Images   Targets         P         R       mAP        F1
Calculating mAP: 100%|█████████████████████████████████████████| 157/157 [05:59<00:00,  1.71s/it]
                 all     5e+03  3.58e+04     0.109     0.773      0.57     0.186
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.335
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.565