Unable to reproduce mAP with yolov3-tiny.weights

TommoAsh commented 5 years ago

Describe the bug I have tried using test.py to get mAP scores for the yolov3-tiny.weights model, and the score I get is 0.177 - much lower than the 0.331 that is claimed by the original author of yolo for that model. Note I have not been able to try the full sized (yolov3.weights) model as I haven't yet been able to access a GPU big enough to cope with it yet, so I can't say if there is the same issue with that model or not.

To Reproduce I used the commit d526ce0d118fcbbfae3e73d9627183b95ceb2b26 (after the mAP updates were merged). I then followed instructions as in the README, ending with this:

python test.py --weights weights/yolov3-tiny.weights --save-json --cfg cfg/yolov3
-tiny.cfg --conf-thres 0.001
Namespace(batch_size=32, cfg='cfg/yolov3-tiny.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.
5, nms_thres=0.45, save_json=True, weights='weights/yolov3-tiny.weights')

Found 3 GPUs
Using cuda _CudaDeviceProperties(name='GeForce GTX 980', major=5, minor=2, total_memory=4043MB, multi_processor_count=16)

/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:25: UserWarning: 
    There is an imbalance between your GPUs. You may want to exclude GPU 2 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
      Image      Total          P          R        mAP
100%|███████████████████████████████████████████████████████████████████████████████████| 157/157 [13:02<00:00,  4.18s/it]
       5000       5000     0.0168      0.139      0.102       1.22s

mAP Per Class:
         person: 0.2868
        bicycle: 0.0403
            car: 0.0432
...
     toothbrush: 0.0196
loading annotations into memory...
Done (t=3.41s)
creating index...
index created!
Loading and preparing results...
DONE (t=4.16s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=59.39s).
Accumulating evaluation results...
DONE (t=9.33s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.086
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.177
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.078
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.039
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.243
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.108
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.175
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.192
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.016
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.149
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.405

Expected behavior I was expecting the final mAP to be 0.331 (or similar to it). I would ideally like to know if this is an error in mAP calculations, an error on my part, or an issue with the tiny model.

Desktop (please complete the following information): Base Docker image nvidia/cuda:10.1-cudnn7-runtime-ubuntu16.04 With Python3.7.1 installed.

Finally Thanks for all the good work - it is much appreciated!

glenn-jocher commented 5 years ago

@TommoAsh I think you may be using an old repo, as some of your hyperparameters are out of date (nms_thres for example). This is what I see when I test on GCP (see https://docs.ultralytics.com/yolov5/environments/google_cloud_quickstart_tutorial/). The results look similar to yours, though our repo mAP is now well aligned to the pycocotools mAP. This must be an issue with -tiny inference, we'll look into it, thanks for the heads up!

python test.py --weights weights/yolov3-tiny.weights --cfg cfg/yolov3-tiny.cfg --save-json
Namespace(batch_size=32, cfg='cfg/yolov3-tiny.cfg', conf_thres=0.001, data_cfg='data/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3-tiny.weights')

Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)

      Image      Total          P          R        mAP
Calculating mAP: 100%|█████████████████████████████████| 157/157 [11:00<00:00,  3.43s/it]
       5000       5000     0.0291      0.437      0.178
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.091
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.178
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.086
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.040
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.257
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.114
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.183
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.201
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.015
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.154
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.424

TommoAsh commented 5 years ago

Thanks for the response Glenn.

It would certainly be appreciated if you could have a look at the tiny inference - it's a model size we're interested in for various reasons, not least just because it's smaller so you can iterate on it faster!

You're right by the way - I had updated to a commit after your mAP changes on one server but not on the one I ran this experiment on. Looks like our outputs pretty much agree anyway.

glenn-jocher commented 5 years ago

@TommoAsh it seems like the second yolov3-tiny layer (there are only 2 layers) may be underperforming, as it has 0.001 mAP@.50-.95 from the pycocotools results. Large produces .257, which would roughly correspond to the .33@0.5 you were looking for. If there is a problem in tiny testing then it should show up in detect as well, so I did comparisons of yolov3-tiny inference between this repository and darknet.

The anecdotal differences seem minimal in these 5 examples, or even show a slight improvement in this repo compared to darknet. I'm not sure what to say really. The tiny objects seem well detected compared to darknet, i.e. the two people and the kite in kites.jpg.

One other possibility is that perhaps yolov3-tiny tests with very different hyperparameters than yolov3. conf_thres, nms_thres, etc. may need to be greatly modified to show improved mAP results (you can run a hyperparameter search yourself). mAP is well validated in yolov3.weights (0.58) and yolov3-spp.weights (0.61) as you can see in https://github.com/ultralytics/yolov3#map.

ultralytics/yolov3 `yolov3-tiny.weights`	darknet `yolov3-tiny.weights`

TommoAsh commented 5 years ago

Good observation - the mAP definitely does drop off in the smaller detections faster than it does for the full sized model. Shame we don't have a full readout of what the Darknet authors got in terms of mAP at different scales to compare to.

I agree that on those 5 images if anything we're doing better than Darknet!

I've set off some parameter sweeps over nms, conf and iou and will report my results back here when they are in, hopefully it's just a case of using different recommended values for tiny when measuring mAP and actually the performance is fine.

glenn-jocher commented 5 years ago

@TommoAsh the irony is that mAP is a terrible metric for measuring real-world performance. Obtaining the published yolov3 mAP requires lowering the conf_thres down to a ridiculously low 0.001. This reduces precision to about 0.1 (as you can see in the readme mAP section), which means that for every TP there are 9 FPs (yes, 10 false positives for every 1 good match gives you the best mAP). It's beyond me why everyone fixates on this metric when it's optimized to create garbage results in real world use. I blame the original object detection organizers for creating this mess, though now the blame lies with all of us that continue to use it as well.

detect.py uses a much more usable conf_thres = 0.50, which is what you see in the pictures above. Setting conf_thres = 0.001 to get the best mAP produces the actual results below (yes, these actual images produce higher mAP than the images in my previous post).

In any case, I don't see any significant deviation between the two implementations.

ultralytics/yolov3 `yolov3-tiny.weights`	darknet `yolov3-tiny.weights`
`--conf-thres 0.001`	`--conf-thres 0.001`

TommoAsh commented 5 years ago

Absolutely agreed - I noticed you were setting the conf threshold really low to get those mAP scores which, as you say, gives terrible results in real world applications. Are you aware of any alternative metrics people are using?

I wonder if some kind of error rate might be more sensible - for example, for a given IOU threshold: (number of items not detected + number of false detections) / (actual number of items in scene) (So 0% would mean perfect detection, anything above that represents increasing numbers of errors) We use something similar in the field of speech recognition (word error rate).

But anyway, I'll keep on with the parameter sweep to see if there is some magic setup that recreates the numbers quoted.

glenn-jocher commented 5 years ago

The F1 score is probably what you want. It's been used by other groups as a performance metric, like the xView competition, and performs better at predicting real world performance. It penalizes large P and R differences as occurs when we set the conf_thres to 0.001, instead it is maximized when the P and R are most similar. I've added F1 as an output in the latest commit, and update the plotting results to display it now.

Here you can see the train and test losses in the last column of the new results.txt plots, along with an F1 metric. These results are from our new data/coco_100img.data example, which trains and tests on the first 100 images of the coco trainval dataset.

from utils.utils import *; plot_results() results

TommoAsh commented 5 years ago

I've been doing some parameter sweeps and the best mAP I can generate with the tiny model is 0.180 - mildly better than the default but still some way off the claims of the yolo author. (The only thing that seemed to help was reducing nms a little bit).

I'll have a look at F1 - it's a sensible metric we've used in other domains and is probably closer to what people care about, as you say. And tuning for F1 is more likely to lead to parameters that will be useful for real world applications than mAP!

glenn-jocher commented 5 years ago

@TommoAsh sounds good. I'm as confused as you are on the tiny mAP, the only explanation I can think of is that the darknet authors reported tiny mAP for large objects only. The pycocotools mAP (a few messages back) are 0.243 mAP@0.50:0.95 for large, which I could see turning into 0.35ish mAP@0.50. Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.243

The latest test.py updates provide a more detailed output, including F1, for all categories:

python3 test.py --save-json --img-size 608 --batch-size 16
Namespace(batch_size=16, cfg='cfg/yolov3-spp.cfg', conf_thres=0.001, data_cfg='data/coco.data', img_size=608, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3-spp.weights')

Using CUDA device0 _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', total_memory=16130MB)
               Class    Images   Targets         P         R       mAP        F1
Computing mAP: 100%|█████████████████████████████████████| 313/313 [07:47<00:00,  1.24s/it]
                 all     5e+03  3.58e+04      0.12      0.81     0.611     0.203

              person     5e+03  1.09e+04     0.165       0.9     0.767     0.278
             bicycle     5e+03       316    0.0741     0.832     0.584     0.136
                 car     5e+03  1.67e+03    0.0814     0.897     0.699     0.149
          motorcycle     5e+03       391     0.166     0.852     0.722     0.278
            airplane     5e+03       131     0.192     0.931      0.88     0.319
                 bus     5e+03       261     0.208      0.87     0.823     0.336
               train     5e+03       212     0.173     0.892      0.82     0.289
               truck     5e+03       352     0.105     0.665     0.523     0.181
                boat     5e+03       475     0.096     0.792     0.521     0.171
       traffic light     5e+03       516    0.0521       0.8     0.557    0.0978
        fire hydrant     5e+03        83     0.184     0.928     0.884     0.307
           stop sign     5e+03        84    0.0931     0.893     0.826     0.169
       parking meter     5e+03        59    0.0727     0.695     0.619     0.132
               bench     5e+03       473    0.0365     0.702     0.391    0.0693
                bird     5e+03       469    0.0875     0.689     0.527     0.155
                 cat     5e+03       195     0.301     0.872      0.78     0.448
                 dog     5e+03       223     0.256     0.879     0.826     0.397
               horse     5e+03       305     0.175     0.931      0.86     0.294
               sheep     5e+03       321     0.249     0.841     0.728     0.384
                 cow     5e+03       384     0.186     0.831     0.731     0.305
            elephant     5e+03       284     0.253     0.972     0.922     0.401
                bear     5e+03        53       0.4     0.906     0.861     0.555
               zebra     5e+03       277     0.251     0.946     0.875     0.397
             giraffe     5e+03       170      0.24     0.929     0.894     0.382
            backpack     5e+03       384    0.0512     0.755     0.428     0.096
            umbrella     5e+03       392     0.105     0.875     0.659     0.188
             handbag     5e+03       483    0.0294     0.737     0.322    0.0565
                 tie     5e+03       297    0.0681     0.848     0.606     0.126
            suitcase     5e+03       310     0.154     0.913     0.696     0.263
             frisbee     5e+03       109     0.189     0.908     0.862     0.313
                skis     5e+03       282    0.0667     0.762     0.451     0.123
           snowboard     5e+03        92     0.104     0.804     0.555     0.185
         sports ball     5e+03       236    0.0822     0.763     0.673     0.148
                kite     5e+03       399     0.181      0.83     0.608     0.297
        baseball bat     5e+03       125     0.083     0.736     0.559     0.149
      baseball glove     5e+03       139    0.0754     0.806     0.649     0.138
          skateboard     5e+03       218     0.118     0.867     0.785     0.208
           surfboard     5e+03       266    0.0927     0.812     0.661     0.166
       tennis racket     5e+03       183     0.141     0.869     0.753     0.243
              bottle     5e+03       966    0.0767     0.823     0.534      0.14
          wine glass     5e+03       366     0.113     0.779     0.585     0.198
                 cup     5e+03       897    0.0928     0.837     0.599     0.167
                fork     5e+03       234    0.0659     0.731       0.5     0.121
               knife     5e+03       291    0.0492     0.684     0.358    0.0919
               spoon     5e+03       253    0.0426     0.755     0.324    0.0806
                bowl     5e+03       620       0.1     0.894     0.573     0.181
              banana     5e+03       371    0.0876     0.695     0.336     0.156
               apple     5e+03       158    0.0521     0.734     0.238    0.0973
            sandwich     5e+03       160      0.12     0.781      0.52     0.208
              orange     5e+03       189    0.0601     0.667     0.286      0.11
            broccoli     5e+03       332       0.1     0.783     0.387     0.178
              carrot     5e+03       346    0.0633     0.673     0.298     0.116
             hot dog     5e+03       164     0.145     0.598     0.458     0.234
               pizza     5e+03       224     0.111     0.804     0.659     0.195
               donut     5e+03       237     0.148     0.802     0.637      0.25
                cake     5e+03       241     0.105     0.734     0.552     0.184
               chair     5e+03  1.62e+03    0.0703     0.757     0.473     0.129
               couch     5e+03       236     0.129     0.788     0.611     0.221
        potted plant     5e+03       431    0.0571     0.824      0.49     0.107
                 bed     5e+03       195     0.157     0.836     0.717     0.265
        dining table     5e+03       634    0.0659     0.828     0.511     0.122
              toilet     5e+03       179      0.24     0.944     0.836     0.383
                  tv     5e+03       257      0.13     0.946     0.825     0.229
              laptop     5e+03       237      0.19     0.886     0.774     0.313
               mouse     5e+03        95    0.0893     0.895     0.742     0.162
              remote     5e+03       241    0.0687     0.834     0.582     0.127
            keyboard     5e+03       117    0.0879     0.906     0.755      0.16
          cell phone     5e+03       291    0.0425     0.742     0.475    0.0803
           microwave     5e+03        88     0.226      0.92     0.823     0.362
                oven     5e+03       142    0.0816     0.845     0.561     0.149
             toaster     5e+03        11    0.0899     0.727     0.412      0.16
                sink     5e+03       211    0.0732     0.853     0.616     0.135
        refrigerator     5e+03       107    0.0932     0.935     0.786     0.169
                book     5e+03  1.08e+03    0.0593     0.654       0.2     0.109
               clock     5e+03       292    0.0817     0.877     0.752     0.149
                vase     5e+03       353    0.0988     0.841     0.589     0.177
            scissors     5e+03        56    0.0552     0.732     0.438     0.103
          teddy bear     5e+03       245     0.156     0.853     0.671     0.264
          hair drier     5e+03        11    0.0488     0.182     0.152    0.0769
          toothbrush     5e+03        77     0.047     0.727     0.334    0.0883

loading annotations into memory...
Done (t=5.42s)
creating index...
index created!
Loading and preparing results...
DONE (t=2.93s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=43.42s).
Accumulating evaluation results...
DONE (t=5.81s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.366
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.607
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.386
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.207
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.391
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.485
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.296
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.464
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.494
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.331
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.517
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.618

TommoAsh commented 5 years ago

Ooh - that new verbose output including F1 is really nice - thank you for that! I'll definitely be grabbing and using it.

I would be surprised if they were reporting the values for just large detection - it's in a table alongside values that make sense on this page: https://pjreddie.com/darknet/yolo/ If I get time I'll reach out to him and ask if there's anything special there that we aren't considering.

Jzz24 commented 5 years ago

The result of the paper is tested on 2014 test-dev, and we tested on the 5k valid set, why don't we use test-dev？

glenn-jocher commented 5 years ago

@jzzai can you supply a *.txt file for the 2014 test-dev set of images?

glenn-jocher commented 5 years ago

@TommoAsh I think the mystery of the low yolov3-tiny mAP may have been solved. You may need to set the cfg anchors to 1, 2, 3 instead of 0, 1, 2 on the last layer due to a mistake during darknet training. See https://github.com/ultralytics/yolov3/issues/256

TommoAsh commented 5 years ago

Ah - very nice. So from reading that ticket, it looks like models that are trained using this code should be unaffected, but if we try to use the pre-trained models we will need to change the cfg anchors, is that right? If that is the case, I think we should be good to close this ticket, finally!

glenn-jocher commented 5 years ago

@TommoAsh yes, exactly! Models trained here should be unaffected. Great, I will close!

ultralytics / yolov3

Unable to reproduce mAP with yolov3-tiny.weights #188