Exact settings to train the provided SSD models on COCO dataset

szm-R commented 6 years ago

System information

What is the top-level directory of the model you are using: tensorflow/models/research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
TensorFlow installed from (source or binary): From source
TensorFlow version (use command below): 1.9.0
Bazel version (if compiling from source): 0.15
CUDA/cuDNN version: 9.2/7.1
GPU model and memory: GTX Geforce 1070
Exact command to reproduce:

Describe the problem

I was wondering if there is any document, explanation, anything! to guide through training the provided ssd configs on COCO dataset. I believe that the provided config files are configured for fine-tuning the pre-trained weights on COCO on another dataset. But, I want to know how to fine-tune a SSD model, say ssd_inception_v2, from a pre-trained model on ImageNet on COCO dataset. I have plenty of experience in fine-tuning a COCO pre-train model on another dataset (UA-DETRAC) and they all have been successful, I have also fine-tuned models trained on ImageNet on the same dataset and these kinds of models also work. However, I'm not able to train the provided configs on COCO dataset to create the reported results. I have examined with quite a number of training hyper-parameters and all of them have failed (by failing I mean getting an AP of 0!).

szm-R commented 6 years ago

Hello again,

Please, can't anyone help me through this?! I'm really stuck! I have trained so many models on many different datasets but I can't get one single detection out of the ones I train on COCO. This time I used the newly added model, ssd_mobilenet_v1_fpn (config file). I have already trained some variations of this model on other datasets by fine-tuning the provided pre-trained weights and by using the same config file. All of them work without a problem. However, everything is so screwed on COCO. Here are my training graphs:

screenshot_2018-08-28 tensorboard

As you can see, the classification loss has decreased to a very low number (which is kind of weird itself comparing to my other models using the same settings), though the localization one is still rather high. One strange thing though is the average number of positive anchors per image:

screenshot_2018-08-28 tensorboard 1

In all other models I have trained so far the average number of positive anchors per image matches the average number of ground truth boxes per image, but here the former is 10 times the latter! And When I try to run the frozen graph on validation images, there are no detections whatsoever, only lots of boxes with very low scores (all of them around 0.006 or 0.007, these numbers vary from model to model but they are all nearly in the same range). One would wonder, shouldn't there be many correct detections with these number of positive anchors per image? Of course along with many false ones too.

I am training the model using legacy/train.py and the config file I am using is the exact one linked above. I have also tested the TFrecords by reading the box coordinates and labels and drawing them on images, they all seem to be correct.

I really appreciate any tips from you guys, it looks to me that I'm missing some very simple, obvious points somewhere in the process...

Shubuo commented 5 years ago

Is there any progress ? I'm curious about te same thing

szm-R commented 5 years ago

@Shubuo I completely let it go and decided to cope with the available pre-trained models. However, I come to this from time to time and it's still one of my biggest questions regarding the object detection field. As I said I have trained various models on numerous datasets and never again have encountered such a problem!

dasmehdix commented 5 years ago

Hi, Did you find the solution?

szm-R commented 5 years ago

@dasmehdix, unfortunately, no!

ghost commented 4 years ago

I was trying to reproduce the published results in the model zoo table, but had no luck there. Evaluation of ssd_mobilenetV1_fpn on coco(the checkpoint given in the model zoo) gave me 5% more than making the fine-tune by my-self. Does anyone have a clue about how to reproduce the published result? I suspect the given pipeline.config files given with the models are not the one used originally.

tensorflow / models

Exact settings to train the provided SSD models on COCO dataset #5090

System information

Describe the problem