Lower mAP compared to Darknet/paper results

nirbenz commented 6 years ago

Hi guys!

I am currently getting 31/30/27 (mAP-50) and 55/53/49.5 (mAP) with this implementation, which is a bit lower than what the paper claims. I was wondering if anyone else experienced this and might have some intuition w.r.t. what's causing the drop?

Thanks! Nir

wizholy commented 6 years ago

how to mesure the map?

nirbenz commented 6 years ago

Using the COCO API.

szxwpp commented 6 years ago

@nirbenz could you please help me to confirm some confusion?

do you eval on COCO dataset?
31/30/27 are three times eval result or something else?
if 2 is yes, then there is a result is map——31 and map-50——55， am i right?

nirbenz commented 6 years ago

Yes
These are the results at different resolutions for mAP-50 and mAP (COCO mAP), respectively.
Not sure I understand?

szxwpp commented 6 years ago

@nirbenz here is the result from the paper.

yolov3

you can see the ap is 33 and ap-50 is 57.9 (608x608). if your result is 31 and 55 at same resolutions, it seems good.

nirbenz commented 6 years ago

2% difference in mAP is rather large, and I was wondering if this is an issue with the Keras implementation (vs original Darknet); i.e., Keras loses accuracy compared to an identical Darknet-based model. Since this is the original model from the paper, losing 2% is rather strange!

pentageonate commented 6 years ago

Are you running the default setup which is 416x416 or did you modify the setup to run 608x608 and how many epochs did you let the model train for?

qqwweee commented 6 years ago

It is certain that the inference result of pretrained model on a given image is the same as Darknet.

nirbenz commented 6 years ago

For clarification, I am using pjreddie's converted model (darknet to keras) and have yet to train a model myself. The results I wrote above are for 608/416/320 resolutions, respectively.

qqwwee - I am using your inference code as-is, so I find this surprising. It is possible that even the original model (darknet one) doesn't achieve the same results as in the paper?

Thanks!

Sanster commented 6 years ago

@nirbenz Are you running on COCO test2017? My result for 416 resolutions on test2017 is:

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.271
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.457
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.290
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.105
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.284
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.399
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.236
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.318
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.321
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.124
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.483

mAP50=0.457 is also lower than 55.3 mentioned in paper. I am trying to run on 608 resolutions

nirbenz commented 6 years ago

It makes no sense to test on the 2017 test set, since the original paper/model uses the circa-2014 train/val split (in which you join the 2014 test+val datasets and take a random 5k subset for evaluation). Using the 5k subset from the original paper, I get this for 416x416:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.299
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.528
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.306
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.128
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.327
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.449
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.261
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.378
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.385
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.179
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.418
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.562

AlphaRalph commented 6 years ago

I'm facing the same problem. I'm working with a custom dataset with ~32k images and after conversion the performance(mAP) is way lower than the original model and even after fine-training the converted Keras model I can't reach the original precision by far. It is conspicuous, that certain objects can still be found precisely and others aren't found at all.

nguyeho7 commented 6 years ago

@AlphaRalph To be clear, you trained your custom dataset using both the original darknet implementation and this one, and found tihs one has a lower mAP?

AlphaRalph commented 6 years ago

I trained in original Darknet with a good mAP and after conversion (no further training in Keras) the performance was significantly lower. Certain classes of objects haven't been found at all after the conversion.

qqwweee commented 6 years ago

@AlphaRalph What you mean is, the same image in darknet inference and in keras inference has a totally different result. I think it is a big problem. Could you explain more about the details?

AlphaRalph commented 6 years ago

Yes, your definitely right! What a pity, this project/ repo is so great but the inference performance in keras can't keep up with the original darknet. Meanwhile I tried a few different approaches, like fine-training the yolo.h5 without freezing the conv layers, but after two epochs the loss started increasing heavily. Fine-training the yolo.h5 with freezing didn't work out as well, since the loss didn't really decrease any further. I fear it may has something to do with this special trick they are doing in darknet, when they're splitting up the image in a 13x13 pattern.

pentageonate commented 6 years ago

If you’re training from scratch, how are the layers initialized? Having them initialized wrong will cause the gradient to diverge rapidly. I think Darknet also included some rather specific elements in the training and inference for layer normalization. I didn’t go back through the Keras model to see if all those elements were in place.

Those differences would absolutely cause the given model to behave differently when training.

On Jun 5, 2018, at 8:09 AM, AlphaRalph notifications@github.com wrote:

Yes, your definitely right! What a pity, this project/ repo is so great but the inference performance in keras can't keep up with the original darknet. Meanwhile I tried a few different approaches, like fine-training the yolo.h5 without freezing the conv layers, but after two epochs the loss started increasing heavily. Fine-training the yolo.h5 with freezing didn't work out as well, since the loss didn't really decrease any further. I fear it may has something to do with this special trick they are doing in darknet, when they're splitting up the image in a 13x13 pattern.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/qqwweee/keras-yolo3/issues/35#issuecomment-394685240, or mute the thread https://github.com/notifications/unsubscribe-auth/AlwY3X9UH9Z7094wQjcXlHLb2eJ-1ScNks5t5nTmgaJpZM4Tz-bq.

qqwweee commented 6 years ago

Among the pipeline preprocessing -> network calculation -> postprocessing, I found prerocessing differs most. In detail, Darknet resize and PIL resize, 0.5 and 128/255. The difference is small. But the undesirable truth is a lower mAP.

nirbenz commented 6 years ago

And there's also the issue of letterboxing, which in Darknet happens under the hood. Those differences can be eliminated by running YOLOv3 over darknet's original code and by running most of the preprocessing code in Python beforehand. I'll try that and report.

gittigxuy commented 6 years ago

@nirbenz ,have you finished your test?could you please report your result?

nirbenz commented 6 years ago

Nope, I actually haven't gotten around to it. I can confirm that I am getting comparable results on both framework implementations though (darknet and Keras for YOLOv3). Keras is still a bit lower but I since it's a non-native implementation I tend to be forgiving (although it'd be wonderful if no differences would appear at all, as I have experienced when converting Caffe models to MXNet for instance). From my experience, fine-tuning on the target framework usually eliminates all differences. Not all frameworks share under the hood implementations and this can sometime causes differences. Haven't tried performing the preprocessing in Python for Darknet though as it proved slightly less straightforward than I thought.

Did anyone else get around the same results as I did for COCO-17 test set?

katerynaCh commented 5 years ago

@nirbenz what kind of split are you using for train / test in the table you reported above?

707346129 commented 5 years ago

I'm facing the same problem. I'm working with a custom dataset with ~32k images and after conversion the performance(mAP) is way lower than the original model and even after fine-training the converted Keras model I can't reach the original precision by far. It is conspicuous, that certain objects can still be found precisely and others aren't found at all.

How is your result? I made a test script based on the yolo.detect_image to generate the JSON file and test on COCO val2017 using cocoapi. The mAP is lower than 0.1! I suspect the problem is from the dataset choosing or the test script.

YunYang1994 commented 5 years ago

https://github.com/YunYang1994/tensorflow-yolov3 hope it helps you

sanmianjiao commented 5 years ago

https://github.com/YunYang1994/tensorflow-yolov3 hope it helps you

your project has the same mAP with the paper?

fourth-archive commented 5 years ago

@sanmianjiao @YunYang1994 @707346129 @katerynaCh this YOLOv3 tutorial may help you: https://github.com/ultralytics/yolov3/wiki/Train-Custom-Data

The accompanying repository works on MacOS, Windows and Linux, includes multigpu and multithreading, performs inference on images, videos, webcams, and an iOS app. It also tests to slightly higher mAPs than darknet, including on the latest YOLOv3-SPP.weights (60.7 COCO mAP), and offers the ability to train custom datasets from scratch to darknet performance, all using PyTorch :) https://github.com/ultralytics/yolov3

qqwweee / keras-yolo3

Lower mAP compared to Darknet/paper results #35