Closed TommoAsh closed 5 years ago
@TommoAsh I think you may be using an old repo, as some of your hyperparameters are out of date (nms_thres
for example). This is what I see when I test on GCP (see https://docs.ultralytics.com/yolov5/environments/google_cloud_quickstart_tutorial/). The results look similar to yours, though our repo mAP is now well aligned to the pycocotools mAP. This must be an issue with -tiny
inference, we'll look into it, thanks for the heads up!
python test.py --weights weights/yolov3-tiny.weights --cfg cfg/yolov3-tiny.cfg --save-json
Namespace(batch_size=32, cfg='cfg/yolov3-tiny.cfg', conf_thres=0.001, data_cfg='data/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3-tiny.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
Image Total P R mAP
Calculating mAP: 100%|█████████████████████████████████| 157/157 [11:00<00:00, 3.43s/it]
5000 5000 0.0291 0.437 0.178
...
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.091
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.178
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.086
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.001
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.040
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.257
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.114
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.183
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.201
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.015
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.154
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.424
Thanks for the response Glenn.
It would certainly be appreciated if you could have a look at the tiny inference - it's a model size we're interested in for various reasons, not least just because it's smaller so you can iterate on it faster!
You're right by the way - I had updated to a commit after your mAP changes on one server but not on the one I ran this experiment on. Looks like our outputs pretty much agree anyway.
@TommoAsh it seems like the second yolov3-tiny layer (there are only 2 layers) may be underperforming, as it has 0.001 mAP@.50-.95 from the pycocotools results. Large produces .257, which would roughly correspond to the .33@0.5 you were looking for. If there is a problem in tiny testing then it should show up in detect as well, so I did comparisons of yolov3-tiny inference between this repository and darknet.
The anecdotal differences seem minimal in these 5 examples, or even show a slight improvement in this repo compared to darknet. I'm not sure what to say really. The tiny objects seem well detected compared to darknet, i.e. the two people and the kite in kites.jpg.
One other possibility is that perhaps yolov3-tiny tests with very different hyperparameters than yolov3. conf_thres
, nms_thres
, etc. may need to be greatly modified to show improved mAP results (you can run a hyperparameter search yourself). mAP is well validated in yolov3.weights (0.58) and yolov3-spp.weights (0.61) as you can see in https://github.com/ultralytics/yolov3#map.
ultralytics/yolov3 yolov3-tiny.weights |
darknet yolov3-tiny.weights |
---|---|
Good observation - the mAP definitely does drop off in the smaller detections faster than it does for the full sized model. Shame we don't have a full readout of what the Darknet authors got in terms of mAP at different scales to compare to.
I agree that on those 5 images if anything we're doing better than Darknet!
I've set off some parameter sweeps over nms, conf and iou and will report my results back here when they are in, hopefully it's just a case of using different recommended values for tiny when measuring mAP and actually the performance is fine.
@TommoAsh the irony is that mAP is a terrible metric for measuring real-world performance. Obtaining the published yolov3 mAP requires lowering the conf_thres
down to a ridiculously low 0.001. This reduces precision to about 0.1 (as you can see in the readme mAP section), which means that for every TP there are 9 FPs (yes, 10 false positives for every 1 good match gives you the best mAP). It's beyond me why everyone fixates on this metric when it's optimized to create garbage results in real world use. I blame the original object detection organizers for creating this mess, though now the blame lies with all of us that continue to use it as well.
detect.py uses a much more usable conf_thres = 0.50
, which is what you see in the pictures above. Setting conf_thres = 0.001
to get the best mAP produces the actual results below (yes, these actual images produce higher mAP than the images in my previous post).
In any case, I don't see any significant deviation between the two implementations.
ultralytics/yolov3 yolov3-tiny.weights |
darknet yolov3-tiny.weights |
---|---|
--conf-thres 0.001 |
--conf-thres 0.001 |
Absolutely agreed - I noticed you were setting the conf threshold really low to get those mAP scores which, as you say, gives terrible results in real world applications. Are you aware of any alternative metrics people are using?
I wonder if some kind of error rate might be more sensible - for example, for a given IOU threshold: (number of items not detected + number of false detections) / (actual number of items in scene) (So 0% would mean perfect detection, anything above that represents increasing numbers of errors) We use something similar in the field of speech recognition (word error rate).
But anyway, I'll keep on with the parameter sweep to see if there is some magic setup that recreates the numbers quoted.
The F1 score is probably what you want. It's been used by other groups as a performance metric, like the xView competition, and performs better at predicting real world performance. It penalizes large P and R differences as occurs when we set the conf_thres to 0.001, instead it is maximized when the P and R are most similar. I've added F1 as an output in the latest commit, and update the plotting results to display it now.
Here you can see the train and test losses in the last column of the new results.txt plots, along with an F1 metric. These results are from our new data/coco_100img.data
example, which trains and tests on the first 100 images of the coco trainval dataset.
from utils.utils import *; plot_results()
I've been doing some parameter sweeps and the best mAP I can generate with the tiny model is 0.180 - mildly better than the default but still some way off the claims of the yolo author. (The only thing that seemed to help was reducing nms a little bit).
I'll have a look at F1 - it's a sensible metric we've used in other domains and is probably closer to what people care about, as you say. And tuning for F1 is more likely to lead to parameters that will be useful for real world applications than mAP!
@TommoAsh sounds good. I'm as confused as you are on the tiny mAP, the only explanation I can think of is that the darknet authors reported tiny mAP for large objects only. The pycocotools mAP (a few messages back) are 0.243 mAP@0.50:0.95 for large, which I could see turning into 0.35ish mAP@0.50.
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.243
The latest test.py
updates provide a more detailed output, including F1, for all categories:
python3 test.py --save-json --img-size 608 --batch-size 16
Namespace(batch_size=16, cfg='cfg/yolov3-spp.cfg', conf_thres=0.001, data_cfg='data/coco.data', img_size=608, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3-spp.weights')
Using CUDA device0 _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', total_memory=16130MB)
Class Images Targets P R mAP F1
Computing mAP: 100%|█████████████████████████████████████| 313/313 [07:47<00:00, 1.24s/it]
all 5e+03 3.58e+04 0.12 0.81 0.611 0.203
person 5e+03 1.09e+04 0.165 0.9 0.767 0.278
bicycle 5e+03 316 0.0741 0.832 0.584 0.136
car 5e+03 1.67e+03 0.0814 0.897 0.699 0.149
motorcycle 5e+03 391 0.166 0.852 0.722 0.278
airplane 5e+03 131 0.192 0.931 0.88 0.319
bus 5e+03 261 0.208 0.87 0.823 0.336
train 5e+03 212 0.173 0.892 0.82 0.289
truck 5e+03 352 0.105 0.665 0.523 0.181
boat 5e+03 475 0.096 0.792 0.521 0.171
traffic light 5e+03 516 0.0521 0.8 0.557 0.0978
fire hydrant 5e+03 83 0.184 0.928 0.884 0.307
stop sign 5e+03 84 0.0931 0.893 0.826 0.169
parking meter 5e+03 59 0.0727 0.695 0.619 0.132
bench 5e+03 473 0.0365 0.702 0.391 0.0693
bird 5e+03 469 0.0875 0.689 0.527 0.155
cat 5e+03 195 0.301 0.872 0.78 0.448
dog 5e+03 223 0.256 0.879 0.826 0.397
horse 5e+03 305 0.175 0.931 0.86 0.294
sheep 5e+03 321 0.249 0.841 0.728 0.384
cow 5e+03 384 0.186 0.831 0.731 0.305
elephant 5e+03 284 0.253 0.972 0.922 0.401
bear 5e+03 53 0.4 0.906 0.861 0.555
zebra 5e+03 277 0.251 0.946 0.875 0.397
giraffe 5e+03 170 0.24 0.929 0.894 0.382
backpack 5e+03 384 0.0512 0.755 0.428 0.096
umbrella 5e+03 392 0.105 0.875 0.659 0.188
handbag 5e+03 483 0.0294 0.737 0.322 0.0565
tie 5e+03 297 0.0681 0.848 0.606 0.126
suitcase 5e+03 310 0.154 0.913 0.696 0.263
frisbee 5e+03 109 0.189 0.908 0.862 0.313
skis 5e+03 282 0.0667 0.762 0.451 0.123
snowboard 5e+03 92 0.104 0.804 0.555 0.185
sports ball 5e+03 236 0.0822 0.763 0.673 0.148
kite 5e+03 399 0.181 0.83 0.608 0.297
baseball bat 5e+03 125 0.083 0.736 0.559 0.149
baseball glove 5e+03 139 0.0754 0.806 0.649 0.138
skateboard 5e+03 218 0.118 0.867 0.785 0.208
surfboard 5e+03 266 0.0927 0.812 0.661 0.166
tennis racket 5e+03 183 0.141 0.869 0.753 0.243
bottle 5e+03 966 0.0767 0.823 0.534 0.14
wine glass 5e+03 366 0.113 0.779 0.585 0.198
cup 5e+03 897 0.0928 0.837 0.599 0.167
fork 5e+03 234 0.0659 0.731 0.5 0.121
knife 5e+03 291 0.0492 0.684 0.358 0.0919
spoon 5e+03 253 0.0426 0.755 0.324 0.0806
bowl 5e+03 620 0.1 0.894 0.573 0.181
banana 5e+03 371 0.0876 0.695 0.336 0.156
apple 5e+03 158 0.0521 0.734 0.238 0.0973
sandwich 5e+03 160 0.12 0.781 0.52 0.208
orange 5e+03 189 0.0601 0.667 0.286 0.11
broccoli 5e+03 332 0.1 0.783 0.387 0.178
carrot 5e+03 346 0.0633 0.673 0.298 0.116
hot dog 5e+03 164 0.145 0.598 0.458 0.234
pizza 5e+03 224 0.111 0.804 0.659 0.195
donut 5e+03 237 0.148 0.802 0.637 0.25
cake 5e+03 241 0.105 0.734 0.552 0.184
chair 5e+03 1.62e+03 0.0703 0.757 0.473 0.129
couch 5e+03 236 0.129 0.788 0.611 0.221
potted plant 5e+03 431 0.0571 0.824 0.49 0.107
bed 5e+03 195 0.157 0.836 0.717 0.265
dining table 5e+03 634 0.0659 0.828 0.511 0.122
toilet 5e+03 179 0.24 0.944 0.836 0.383
tv 5e+03 257 0.13 0.946 0.825 0.229
laptop 5e+03 237 0.19 0.886 0.774 0.313
mouse 5e+03 95 0.0893 0.895 0.742 0.162
remote 5e+03 241 0.0687 0.834 0.582 0.127
keyboard 5e+03 117 0.0879 0.906 0.755 0.16
cell phone 5e+03 291 0.0425 0.742 0.475 0.0803
microwave 5e+03 88 0.226 0.92 0.823 0.362
oven 5e+03 142 0.0816 0.845 0.561 0.149
toaster 5e+03 11 0.0899 0.727 0.412 0.16
sink 5e+03 211 0.0732 0.853 0.616 0.135
refrigerator 5e+03 107 0.0932 0.935 0.786 0.169
book 5e+03 1.08e+03 0.0593 0.654 0.2 0.109
clock 5e+03 292 0.0817 0.877 0.752 0.149
vase 5e+03 353 0.0988 0.841 0.589 0.177
scissors 5e+03 56 0.0552 0.732 0.438 0.103
teddy bear 5e+03 245 0.156 0.853 0.671 0.264
hair drier 5e+03 11 0.0488 0.182 0.152 0.0769
toothbrush 5e+03 77 0.047 0.727 0.334 0.0883
loading annotations into memory...
Done (t=5.42s)
creating index...
index created!
Loading and preparing results...
DONE (t=2.93s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=43.42s).
Accumulating evaluation results...
DONE (t=5.81s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.366
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.607
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.386
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.207
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.391
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.485
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.296
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.464
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.494
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.331
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.517
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.618
Ooh - that new verbose output including F1 is really nice - thank you for that! I'll definitely be grabbing and using it.
I would be surprised if they were reporting the values for just large detection - it's in a table alongside values that make sense on this page: https://pjreddie.com/darknet/yolo/ If I get time I'll reach out to him and ask if there's anything special there that we aren't considering.
The result of the paper is tested on 2014 test-dev, and we tested on the 5k valid set, why don't we use test-dev?
@jzzai can you supply a *.txt file for the 2014 test-dev set of images?
@TommoAsh I think the mystery of the low yolov3-tiny mAP may have been solved. You may need to set the cfg anchors to 1, 2, 3 instead of 0, 1, 2 on the last layer due to a mistake during darknet training. See https://github.com/ultralytics/yolov3/issues/256
Ah - very nice. So from reading that ticket, it looks like models that are trained using this code should be unaffected, but if we try to use the pre-trained models we will need to change the cfg anchors, is that right? If that is the case, I think we should be good to close this ticket, finally!
@TommoAsh yes, exactly! Models trained here should be unaffected. Great, I will close!
Describe the bug I have tried using test.py to get mAP scores for the yolov3-tiny.weights model, and the score I get is 0.177 - much lower than the 0.331 that is claimed by the original author of yolo for that model. Note I have not been able to try the full sized (yolov3.weights) model as I haven't yet been able to access a GPU big enough to cope with it yet, so I can't say if there is the same issue with that model or not.
To Reproduce I used the commit d526ce0d118fcbbfae3e73d9627183b95ceb2b26 (after the mAP updates were merged). I then followed instructions as in the README, ending with this:
Expected behavior I was expecting the final mAP to be 0.331 (or similar to it). I would ideally like to know if this is an error in mAP calculations, an error on my part, or an issue with the tiny model.
Desktop (please complete the following information): Base Docker image nvidia/cuda:10.1-cudnn7-runtime-ubuntu16.04 With Python3.7.1 installed.
Finally Thanks for all the good work - it is much appreciated!