rykov8 / ssd_keras

Port of Single Shot MultiBox Detector to Keras
MIT License
1.1k stars 551 forks source link

Retraining SSD with Udacity's self driving car dataset #94

Open cocoza4 opened 7 years ago

cocoza4 commented 7 years ago

Hi all,

I'm trying to apply this implementation of SSD to Udacity's self driving car dataset (see https://github.com/udacity/self-driving-car/tree/master/annotations). It turned out that the original weights (weights_SSD300.hdf5) performed better than the ones I got after retraining (with some layers' weights fixed) with the new dataset. I'm not sure I missed out something or not. Could anyone share their steps and/or point out something that I might have missed.

Here's what I did.

  1. create a ground-truth of the same format as gt_pascal.pkl with the new dataset (bounding box data normalized and label encoded).
  2. modify NUM_CLASSES and load ground-truth accordingly in SSD_training.ipynb
  3. train the model.

The new model predicted poorly as shown below. image Notice that each car in the image is predicted with one correct bounding box with highest confidence and a few variants of that bounding box. I guess it's the prior bounding boxes we set in prior_boxes_ssd300.pkl.

The model with the original weights however predicted better. image This time, notice that only the correct bounding boxes are predicted. Its prior boxes are not shown. In both cases, I filtered out predictions with confidence < 0.7.

The validation error started at 2.56 and reduced gradually to 1.27 in the 30th epoch. I noticed that after 20th epoch, the model started to converge as the error didn't go down much.

I also have a few more questions

  1. In SSD_training.ipynb, it seems like not every pre-trained layers are fixed freeze = ['input_1', 'conv1_1', 'conv1_2', 'pool1', 'conv2_1', 'conv2_2', 'pool2', 'conv3_1', 'conv3_2', 'conv3_3', 'pool3']#, # 'conv4_1', 'conv4_2', 'conv43', 'pool4'] lines preceded with # are ignored. Why conv4, conv5_ for example not fixed as well?

  2. Preprocessing step uses a utility keras.applications.imagenet_utils.preprocess_input to normalize data in Generator.generate(). Is this utility made specifically for the imagenet dataset?

  3. After reading the documentation, I don't know what neg_pos_ratio in MultiboxLoss(NUM_CLASSES, neg_pos_ratio=2.0) does. How does it affect training, etc?

  4. Given that the Udacity's dataset is huge (almost 5 GBs in size), does training the whole model from scratch(no layers with fixed weight) make more sense?

thanks Peeranat F.

AloshkaD commented 7 years ago

Hi Peeranat, I'll try to answer your question base on the little information that you have provided. Lets start with the performance that you see when you test your classifier after you retrain vs using the provided weights. The SSD has a rather odd way of loading the ground truth bounding boxes and it appears like there is something wrong with the way you are providing your bounding boxes. Again, i will need to see your function to be able to tell. Here is bounding boxes array for one sample image from the VOC2007 dataset as a reference :

`[[ 0.352 0.34615385 0.946 0.53205128 0. 0. 0.

            1. 0.
            1. 0.
        1. ] [ 0.314 0.31410256 0.362 0.48397436 0. 0. 0.
            1. 0.
            1. 0.
        1. ] [ 0.08 0.42307692 0.108 0.54807692 0. 0. 0.
            1. 0.
            1. 0.
        1. ] [ 0.108 0.42307692 0.136 0.53205128 0. 0. 0.
            1. 0.
            1. 0.
        1. ] [ 0.136 0.42628205 0.152 0.53846154 0. 0. 0.
            1. 0.
            1. 0.
        1. ] [ 0.164 0.42628205 0.186 0.49679487 0. 0. 0.
            1. 0.
            1. 0.
        1. ] [ 0.148 0.42628205 0.166 0.49679487 0. 0. 0.
            1. 0.
            1. 0.
        1. ] [ 0.392 0.39102564 0.434 0.58653846 0. 0. 0.
            1. 0.
            1. 0.
        1. ] [ 0.026 0.39423077 0.078 0.58333333 0. 0. 0.
            1. 0.
            1. 0.
        1. ] [ 0.004 0.34294872 0.32 0.49038462 0. 0. 0.
            1. 0.
            1. 0.
        1. ]]` The udacity dataset is big and has > 10000 images (depending on weather you are using both provided datasets or one of them) but it does not sound like 30 epochs are sufficient to train all those sample. For retaining on the VOC2007 for example it took me 400 epochs with 500 augmented images each. The network converges really slowly unlike the U-Net, segnet, Nvidia's, comma.ai's or Alexnet. You have provided the error value, but that isn't the only thing to look at. You have validation accuracy and the overall accuracy. Also, what loss function have you used, is it an IoU? there numbers have different meaning. Let me now answer your 4 questions based on the so little information you've provided 1- we are fixing freeze = ['input_1', 'conv1_1', 'conv1_2', 'pool1', 'conv2_1', 'conv2_2', 'pool2', 'conv3_1', 'conv3_2', 'conv3_3', 'pool3'] layers to not mess up the weight file that was retrained on the VGG. If you remove these layers from the freeze function and train the entire network you will activelly change the weight values in the pretrained initialization (weights_SSD300.hdf5). Instead, we retrain the last layers, which are in this case ['conv4_1', 'conv4_2', 'conv4_3', 'pool4]. This is something you will study in term 3 of Udacity's SDC.

2- I can't answer this question, it isn't clear to me.

3- The short answer is yes, it is a multibox loss function (obviously) and effects how the loss is calculated. It basically tell how much the ratio of negative to positive boxes in loss should be.

4- Yes, it does make sense. You can still initialize with weights_SSD300.hdf5 but be prepared for a long, very long computation. You know what, I have a lot of free AWS credits and I can afford weeks of computation. I might do it for you if you like. But first I will need to make sure you are feeding your annotations correctly to not waste that computation. I hope this all make sense to you.

cocoza4 commented 7 years ago

Hi @AloshkaD, Thanks for your reply. It helps me and my team alot. My team are taking Udacity's self driving car nano degree program and developing a self driving car software pipeline that senses and perceives the environment of the place where we live. I am responsible for the object detection module which is implemented with SSD. Now let me clarify your points.

As your comment that the ground truth bounding boxes (annotations) provided to the SSD seems to be incorrect, I have checked the format of gt_pascal.pkl and it was similar to ground-truth.pkl which stores the annotations. Check out https://github.com/cocoza4/ssd_keras/blob/master/groundtruth_builder.ipynb to see how the annotations were built. ground-truth.pkl is the pickle file that contains the annotations of both Udacity's datasets. For now I have limited the classes to only car, pedestrian, and truck.

You also suggested the number of epochs (30) is not sufficient. I will try a few hundred epochs but I'm not sure how long will it take to train both Udacity's datasets? a few days? or weeks?

The loss function I used was the one mentioned in the paper. The code in this repo already implemented it. It is 1/N(localization loss + confidence loss). see MultiboxLoss.compute_loss in https://github.com/cocoza4/ssd_keras/blob/master/ssd_training.py.

The remaining 4 questions I asked

  1. My question wasn't clear enough. Sorry. If you look into the implementation of keras.applications.imagenet_utils.preprocess_input image It normalizes features with hard-coded numbers. I thought those numbers were computed from imagenet dataset? when I apply the function to the new dataset, the features won't be zero-centered as they are supposed to.

  2. I will initialize the weights with weights_SSD300.hdf5 and train the model for 200 epochs and see if it improves the accuracy or not. And it would be great if you can train the model too so we can cross check what I might have done wrong. The source code for training can be found on my repo https://github.com/cocoza4/ssd_keras.

Again, my colleagues and I are very thankful to you for your help.

thanks Peeranat F.

AloshkaD commented 7 years ago

Great, I will comment on your questions /remarks soon. I have a self-driving car engineer interview on Tuesday and I'm preparing for it :D I just wanted to say a quick remark for your training. A training on the size on Udacity's images shouldn't take weeks but rather a day or a few days. Assuming you are training on px2large which is a TitanX wih 12GB ram. It's very important to observe your loss and accuracy functions and determine when to stop the training once it goes flat for many epochs (I normally wait for 10, depending on the images in each epoch, to make sure I'm not temporarily stuck in a local minima). I highly recommend watching Andrew NG's nuts and bolts for deep learning to help you determine what to do if your model isn't learning. https://www.youtube.com/watch?v=F1ka6a13S9I
Also, try to visualize your learning details with tensorboard. It's a very simple callback that you need to add to your code. I can write it for you if you don't know how. Will get back to you soon

cocoza4 commented 7 years ago

Just finished watching it. Great video. Thank you and good luck with the interview :)

AloshkaD commented 7 years ago

@cocoza4 I finally had the chance to look at your code, I hope it's isn't too late :) The code does look correct. I've tested the order for x/y-min/max on my end with this code

import csv
reader = csv.reader(open('object-dataset/labels_SSD.csv'))
#reader.columns= ['Frame',  'xmin', 'xmax', 'ymin','ymax', 'class', 'Label','Unnamed']

gt = {}

for row in reader:
    ## replacing the clases with hard coded clases
    #[1,0,0,0,0]=car
    #[0,1,0,0,0]=truck
    #[0,0,1,0,0]=biker
    #[0,0,0,1,0]=pedestrian
    #[0,0,0,0,1]=trafficlight
    if row[5] == 'car':
        row[5] = '1'
    else:
        row[5] = '0'
    ##
    if row[6] == 'truck':
        row[6] = '1'
    else:
        row[6] = '0'
    ##
    if row[7] == 'biker':
        row[7] = '1'
    else:
        row[7] = '0'
    ##
    if row[8] == 'pedestrian':
        row[8] = '1'
    else:
        row[8] = '0'
    ##
    if row[9] == 'trafficLight':
        row[9] = '1'
    else:
        row[9] = '0'

    key = row[0]
    #if key in gt:

    gt[key] = [int(x) for x in row[1:]]
    gt[key] = np.array([gt[key]]) 
print (gt) 

The only way for me to check the loss function cited in the paper work or not is to train on the Udacity data sets and test for myself. I guess I'll do it in the weekend and let the network train for a few days. As to question 2, I don't think I can help there, sorry. In point 4, have you trained, what is the outcome? I'll check your code tonight and tomorrow, I promise!

AloshkaD commented 7 years ago

@cocoza4 which specific part of the code do you want me to look at?

cocoza4 commented 7 years ago

@AloshkaD Thanks for your reply. It is not too late :) I have trained my network with initialized weights from weights_SSD300.hdf5 and set to train for 200 epochs with a batch size of 16. From the screenshot below the loss remained unchanged since epoch 35 at 1.23 until epoch 66. So I stopped training at that point.

I then predicted the same image from the first comment in the thread.

It looks a bit better but still there's room for improvement. Ignore the change in the color of the bounding boxes. They are all car.

Now I just got tensorboard to work with Keras :D finally. What I'll do next is to evaluate the trained model every epoch with a validation set using loss, accuracy, precision/recall, f1 and roc as you suggested earlier and see how the model learns.

Could you have a look into https://github.com/cocoza4/ssd_keras/blob/master/SSD_training.ipynb especially the Generator class and how I train the model? I have modified the Generator class to augment images only with brightness, horizontal flip, image translation and stretching techniques. Right now the random_sized_crop method has bug. So I applied translation and stretching instead. The implementation of image stretching and translation is taken from https://github.com/udacity/self-driving-car/blob/master/vehicle-detection/u-net/main_car_Unet_train_IoU.ipynb.

I forgot to tell you that the Udacity datasets are biased in favour of class car.

See https://github.com/cocoza4/ssd_keras/blob/master/groundtruth_builder.ipynb for details. This will also incur the performance of the network too.

The paper uses mAP as an evaluation metric and this repo doesn't provide a utility to measure such a metric. I will implement this based on the answer of this thread https://stats.stackexchange.com/questions/260430/average-precision-in-object-detection which also takes IoU into account. Does this make sense to you?

How's the training going?

thanks Peeranat F.

AloshkaD commented 7 years ago

@Peeranat It looks much better.We need to look at the training accuracy and validation accuracy.

I've noticed the biases in the Udacity data sets and it is highly recommended to create a generator that augments the images in a way to make all classes equal size (I can explain further if you need). Or you know what, treat cars and trucks as cars and find an opensource dataset for pedestrians. The unbalance is really big.

Yeah, try to implement mAP, not so important but helpful. I haven't started the training on Udacity's datasets yet.

I guess tomorrow I'll put things together and put the code in AWS. I may wait until you implement mAP to put all that together at once.

Based on my experience augmentation could improve the accuracy up to %14. I read in some papers up to %11 but I don't remember which paper was that to cite it here.

I'll look at the code parts that you've referenced tomorrow.

Btw, I'm translating Yolo900 from its orginal C implementation into python ad TF, will test that on Udacity's data sets too and share the link with you when done to test and compare if you are interested.

AloshkaD commented 7 years ago

Can you try to change the bbox scale factor in

def predict(model, image_array, prior_boxes, original_image_shape,
            num_classes=21, lower_probability_threshold=.1,
            iou_threshold=.5, background_index=0,
            box_scale_factors=[.1, .1, .2, .2]):

and see if that improves the prediction. Don't retrain, this class is in infrence.py and is called to predict the image based on the weights that you load.

Also try to change theprobability_threshold for the IOU and see if this changes anything, I'm sure it will.

cocoza4 commented 7 years ago

I can't find infrence.py in the repo. Are you sure it's from the same project?

Btw, http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/ seems promising for pedestrian dataset.

AloshkaD commented 7 years ago

That dataset look good! I don't remember if I have created it or its from the repo, here it is

import numpy as np
from .preprocessing import preprocess_images
from .boxes import decode_boxes
from .boxes import filter_boxes
from .preprocessing import resize_image_array
from .boxes import denormalize_box
#from .boxes import apply_non_max_suppression
from .tf_boxes import apply_non_max_suppression

def predict(model, image_array, prior_boxes, original_image_shape,
            num_classes=21, lower_probability_threshold=.1,
            iou_threshold=.5, background_index=0,
            box_scale_factors=[.1, .1, .2, .2]):

    image_array = image_array.astype('float32')
    input_size = model.input_shape[1:3]
    image_array = resize_image_array(image_array, input_size)
    image_array = np.expand_dims(image_array, axis=0)
    image_array = preprocess_images(image_array)
    predictions = model.predict(image_array)
    predictions = np.squeeze(predictions)
    decoded_predictions = decode_boxes(predictions, prior_boxes,
                                              box_scale_factors)
    selected_boxes = filter_boxes(decoded_predictions,
                num_classes, background_index,
                lower_probability_threshold)
    if len(selected_boxes) == 0:
        return None
    selected_boxes = denormalize_box(selected_boxes, original_image_shape)
    selected_boxes = apply_non_max_suppression(selected_boxes, iou_threshold)
    return selected_boxes

I'm looking at your code now and I started with GT maker @"https://github.com/cocoza4/ssd_keras/blob/master/groundtruth_builder.ipynb", in the last line you are dumping dataset into ground-truth.pkl, 1- where is dataset? it does not exist! 2- I've loaded ground-truth.pkl from your repo to inspect it. The crowdai sample is a bit off!


'object-dataset/1478020338695742081.jpg': array([[ 0.00416667,  0.48333333,  0.12604167,  0.61333333,  1.        ,
          0.        ,  0.        ],
        [ 0.603125  ,  0.465     ,  0.67083333,  0.555     ,  1.        ,
          0.        ,  0.        ],
        [ 0.68333333,  0.46      ,  0.76458333,  0.55      ,  1.        ,
          0.        ,  0.        ]]),
 'object-detection-crowdai/1479504186358246947.jpg': array([[  4.45312500e-01,   4.75833333e-01,   4.90625000e-01,
           5.35833333e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00],
        [  7.11979167e-01,   4.56666667e-01,   9.43229167e-01,
           6.40000000e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00],
        [  6.42708333e-01,   4.41666667e-01,   7.78645833e-01,
           5.87500000e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00],
        [  6.09895833e-01,   4.60000000e-01,   6.64062500e-01,
           5.77500000e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00],
        [  5.78645833e-01,   4.75833333e-01,   6.29687500e-01,
           5.64166667e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00],
        [  5.58333333e-01,   4.73333333e-01,   6.07812500e-01,
           5.41666667e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00],
        [  5.35416667e-01,   4.59166667e-01,   5.89583333e-01,
           5.38333333e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00],
        [  1.57812500e-01,   4.84166667e-01,   2.03645833e-01,
           5.30000000e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00],
        [  1.34895833e-01,   4.87500000e-01,   1.57291667e-01,
           5.39166667e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00],
        [  3.54166667e-02,   4.73333333e-01,   1.16666667e-01,
           5.45833333e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00],
        [  5.20833333e-04,   4.85833333e-01,   2.39583333e-02,
           5.41666667e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00],
        [  3.12500000e-01,   4.81666667e-01,   3.31250000e-01,
           5.09166667e-01,   1.00000000e+00,   0.00000000e+00,
           0.00000000e+00]]),

Your generator seem alright to me, I'll test it and crosscheck. Pay attention to the metrics, precision, recall and F1 have been removed in keras 2.0.

AloshkaD commented 7 years ago

@cocoza4 in the sample from ground-truth.pkl above, you are suppose to have 1 class more than the classes that you are classifying. That class is for background. SO there is definitely something missing in your conversion into the pkl file. Again,the ground-truth maker isn't complete and if there is no code I can't verify or tell whats wrong.

cocoza4 commented 7 years ago

@AloshkaD Sorry for late reply. I've been implementing mAP and preparing pedestrian dataset as you suggested. Almost finish now. So to answer your questions.

  1. I didn't include the dataset in my repo because it's too huge (almost 5 gb in size). However, you can download both datasets(CrowdAI and Autti) from https://github.com/udacity/self-driving-car/tree/master/annotations

Below screenshot shows the folders structure where both datasets are located

Under the root folder, ssd_keras, create folders

the folders object-dataset/ and object-detection-crowdai/ are where Autti and CrowdAI datasets are located respectively. Make sure you use labels_crowdai.csv for label annotations for CrowdAI dataset provided in https://github.com/udacity/self-driving-car/tree/master/annotations as the original labels.csv has incorrect column header order. The GT builder will load the datasets from these folders.

2 "2- I've loaded ground-truth.pkl from your repo to inspect it. The crowdai sample is a bit off!" I'm not sure what you mean by "The crowdai sample is a bit off!"?

Good catch! At first I thought we don't need to concern about background class in the pickle file. But we do in the SSD_training.ipynb (we add 1 to the NUM_CLASSES variable to take into account the background). If you look at gt_pascal.pkl as an example, some annotations have 1 in the first element of the one-hot encoded label. See below screenshot image

I'm not sure you have already looked into gt_pascal.pkl or not? I will fix this by treating the first class label as Background and it will always be 0. Therefore, all classes will now be [background, car, pedestrian]. I decided to combine truck and car classes and a single class car.

thanks Peeranat F.

AloshkaD commented 7 years ago

@cocoza4 sounds good! I wasn't talking about the datasets, I was asking about how gt_pascal.pkl was created. The code for creating it is incomplete. If that's something you don't wish to share I will create my own to test your code as I want to be flexible in testing more classes such as trafficlights. I want to see if adding and removing classes contributes to the issue that you are facing in the classification, I personally got so interested in that.

I'm in the process of testing your generator, there is something not quite well and I will keep you posted as soon as I get that done.

cocoza4 commented 7 years ago

@AloshkaD thanks for reviewing my code. The gt_pascal.pkl was provided in the original repo(see https://github.com/rykov8/ssd_keras). It contains annotations for the other dataset. I didn't touch it. I just printed in https://github.com/cocoza4/ssd_keras/blob/master/groundtruth_builder.ipynb to see how it is formatted. The GT pickle file for the Udacity's datasets is ground-truth.pkl which will be used in SSD_training.ipynb.

I reran the whole file https://github.com/cocoza4/ssd_keras/blob/master/groundtruth_builder.ipynb and it created ground-truth.pkl successfully.

I'm now fixing the groundtruth_builder.ipynb to include background in ground-truth.pkl as you suggested. I will retrain the model and let you know as soon as it's done.

thanks Peeranat F.

AloshkaD commented 7 years ago

@cocoza4 sorry, I was referring to ground-truth.pkl and not gt_pascal.pkl. I ran groundtruth_builder.ipynb but dataset wasn't defined. Weird, if you are certain its the updated one I'll go ahead and take a second look tonight. Me too, I'll start training if groundtruth_builder.ipynb turned out to be working after I add the background attribute.

AloshkaD commented 7 years ago

@cocoza4 Ok, I found the dataset dictionary, I didn't see it the first time, sorry again about that. So I have modified the code to include background, here it is

dataset = dict()
for idx, row in df.iterrows():
    coords = [row.xmin, row.ymin, row.xmax, row.ymax]
    label_encoded = lb.transform([row.label]).ravel()
    label_encoded = np.insert(label_encoded,0, 0, axis=0)
    #print('label_encoded',label_encoded)
    val = np.array([np.hstack((coords, label_encoded))])

    current = dataset.get(row.frame)
    if current is None:
        dataset[row.frame] = val
    else:
        dataset[row.frame] = np.vstack((current, val))
cocoza4 commented 7 years ago

@AloshkaD OK. I'm now re-training everything. This time learning details can be visualized with tensorboard and are evaluated every epoch with mAP, precision, recall, f1, and loss metrics.

As you suggested, I added one more class for background to the annotations file(see with-background-ground-truth.pkl). However, in SSD_training.ipynb, +1 must be added to to NUM_CLASSES variable, otherwise, an error is raised.

I didn't add more pedestrian samples because I still got stuck trying to transform Caltect's pedestrian dataset(http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/) to the format similar to that of Udacity as the dataset is a video file :D

I have updated the repo. I will let you know when the training is done.

thanks Peeranat F.

AloshkaD commented 7 years ago

@cocoza4 awesome! yeah, you certainly have to add 1 to num_classes. I'm almost uploading the code to aws. Will keep you posed. The only problem is that in my keras version there is no precision, recall, f1 so we won't be able to compare with that. I'll take a look at your mAP and implement it the same way you had to help us compare. I'll take a look at the pedestrian dataset to see if I can help. Btw, were you able to get tensorboard's image visualization working? I couldn't!

cocoza4 commented 7 years ago

No. It took me quite some time so I gave up :D Right now, I only plot learning details.

AloshkaD commented 7 years ago

@cocoza4 there is something wrong here in SSD_training you have

some constants

CLASSES = ['background' 'car' 'pedestrian']
# CLASSES = ['car' 'pedestrian']
NUM_CLASSES = 4 # 1 added to include background
input_shape = (300, 300, 3)

The background is already included in the list hence you don't need to add "1", what do you think?

cocoza4 commented 7 years ago

Well actually I need to. I tried NUM_CLASSES = 3, but in SSD_training I got an error somewhere about number of columns mismatch. I think you have to always +1 to NUM_CLASSES.

AloshkaD commented 7 years ago

@cocoza4 something isn't right here, I'll look it up!

AloshkaD commented 7 years ago

@cocoza4 SO I'm testing your generator and the training script and it does not show a progress bar. I know its training from monitoring the GPU resources. Do you have the same issue?

cocoza4 commented 7 years ago

@AloshkaD see below the learning details (orange line for validation, blue line for training) image

The loss function shows a significant amount of overfitting. I didn't plot the training error for the rest of the metrics. I will implement it later.

I think there's something wrong with the background class. I'm not sure whether we should add background to the annotations or not because the validation loss this time (1.20) is a bit lower than the previous time (1.23).

But when I predicted the model on the same image as before.

Definitely, there is something wrong. I will investigate this.

thanks Peeranat F.

AloshkaD commented 7 years ago

@cocoza4 Huh! this looks pretty ugly. Is the code that produced this output updated on github?

I'm working on the background issue too.

cocoza4 commented 7 years ago

Yes. The one on github is latest

AloshkaD commented 7 years ago

@cocoza4 Ok, I'm training the network and we should see the outcome tomorrow. Here is how I'm fitting but better look at the whole code when I uploaded "if it works :) "

# instantiating callbacks
learning_rate_schedule = LearningRateScheduler(scheduler)
model_names = (trained_models_filename)
model_checkpoint = ModelCheckpoint(model_names,
                                   monitor='val_loss',
                                   verbose=1,
                                   save_best_only=False,
                                   save_weights_only=False)
model_tensorboard = TensorBoard(log_dir='./logs/200-epoch/udacity', 
                                         histogram_freq=10,
                                         batch_size = 8,
                                         write_graph=True,
                                         write_images=True,

                                         write_grads=True,
                                         embeddings_freq =5,
                                         embeddings_layer_names = 'conv3_3, conv4_3' 
                                 )

# training model with real-time data augmentation
model.fit_generator(image_generator.flow(mode='train'),
                    steps_per_epoch=int(len(train_keys) / batch_size),
                    epochs = num_epochs, verbose = 1,
                    callbacks=[model_checkpoint, learning_rate_schedule, model_tensorboard],
                    validation_data=image_generator.flow(mode='val'),
                    validation_steps=int(len(validation_keys) / batch_size))
AloshkaD commented 7 years ago

A very important thing which is driving your accuracy very low is the way the training and validation is split. The way both you and I are doing it is by taking %20 out for validation and the pool of images ends up from the bottom of the list which has the second part of the images. We should shuffle first before we split, not sure we are doing that already. I'll do that when I get the results tomorrow and retrain. Btw, I forgot to implement mAP and will add it tomorrow using the same code you have for result comparison. .

AloshkaD commented 7 years ago

@cocoza4 will you please include accuracy in your model when you train, it's always easier to look at that.

cocoza4 commented 7 years ago

@AloshkaD Sure next time I will. I already shuffled the dataset before the split.

image

cocoza4 commented 7 years ago

@AloshkaD it looks like you are using a different generator. Would you mind sharing it so I can have a look?

AloshkaD commented 7 years ago

@cocoza4 Sure, I'll upload it and send you the link within a few hours.

AloshkaD commented 7 years ago

@cocoza4 sorry about the delay, I'm working on it now

AloshkaD commented 7 years ago

@cocoza4 here it is, look at the second part: running with udacity db https://github.com/AloshkaD/SSD_sandox/blob/master/src/train-udacity.ipynb

mslavescu commented 7 years ago

@cocoza4 and @AloshkaD check my demo video, and would be good to try that SSD Tensorflow implementation also, to me it looks pretty accurate, it is even more accurate/reliable when I crop (not zoom) just the center of the image (in the video you see the detection/tracking on full image):

SSD Tensorflow based car detection and tracking demo for OSSDC.org VisionBasedACC PS3/PS4 simulator https://m.youtube.com/watch?v=dqnjHqwP68Y

You should be able to reporduce it easily (even directly on Youtube videos) with this Python script or IPython notebook:

https://github.com/OSSDC/SSD-Tensorflow/commit/26fa7eea5155ac3989476936628e62be3d773b95

Let me know if you encounter any problems.

More references:

https://medium.com/@mslavescu/learn-by-example-f539ad814117

https://mobile.twitter.com/GTARobotics/status/855870333097836544

AloshkaD commented 7 years ago

@mslavescu Awesome, I'm looking at that now. Thanks for sharing. How did you predict the distance to the other cars and objects since the images for training weren't collected with a stereo cam?

AloshkaD commented 7 years ago

@cocoza4 The implementation didn't go very well with me and the accuracy is terrible. I guess I'm going to go back to the code and check what did I do wrong. Any luck with your implementation?

cocoza4 commented 7 years ago

@mslavescu thanks. This is awesome. I will have a look at it. I would like to know how you predict the distance to the other cars too.

cocoza4 commented 7 years ago

@AloshkaD I'm still out of luck :( It seems to me we don't have to add background class for the annotations. But we have to +1 to NUM_CLASSES in SSD_training. I will retrain with this setting and get back to you soon.

AloshkaD commented 7 years ago

@cocoza4 me too, I'll do the same now and let you know by tmw.

cocoza4 commented 7 years ago

@AloshkaD I trained the model with 2 classes car and pedestrian and had NUM_CLASSES = 3. The training was better considering the validation loss compared to all previous settings.

The loss went down to 0.84 in epoch 34 and then it stopped for some reason.

Other evaluation metrics were not working, I probably did something wrong. I will go fix it in the next training. The implementation of accuracy metric is from Keras

model.compile(optimizer=optim, metrics=['acc'],
loss=MultiboxLoss(NUM_CLASSES, neg_pos_ratio=2.0).compute_loss)

I guess it would make more sense to implement the metric based on https://stats.stackexchange.com/questions/260430/average-precision-in-object-detection But there is no TN. Do you think it's ok to measure accuracy with the following formula?

accuracy = TP / (TP + FN + FP)

Something is wrong with the training because after the prediction the confidence is very low (see second column).

results [array([[ 1.        ,  0.02606382,  0.56925744,  0.49550152,  0.60946363,
         0.5651052 ],
       [ 1.        ,  0.02540159,  0.17697434,  0.49378216,  0.22310685,
         0.5980978 ],
       [ 2.        ,  0.02411435,  0.66298014,  0.47095406,  0.83421701,
         0.65039515],
       [ 1.        ,  0.02324589,  0.56523108,  0.48646542,  0.59661794,
         0.54539555],
       [ 1.        ,  0.02270815,  0.66298014,  0.47095406,  0.83421701,
         0.65039515],
       [ 2.        ,  0.02138992,  0.56523108,  0.48646542,  0.59661794,
         0.54539555],
       [ 2.        ,  0.02104601,  0.2785593 ,  0.48142755,  0.34146526,
         0.59746993],
       [ 1.        ,  0.02091883,  0.59156978,  0.4777073 ,  0.67640877,
         0.57571268],
       [ 2.        ,  0.02073174,  0.56925744,  0.49550152,  0.60946363,
         0.5651052 ],
       [ 2.        ,  0.01985011,  0.59156978,  0.4777073 ,  0.67640877,
         0.57571268],
       [ 1.        ,  0.01958299,  0.83595729,  0.43074661,  1.        ,
         0.57367831],
       [ 2.        ,  0.01904126,  0.17697434,  0.49378216,  0.22310685,
         0.5980978 ],
       [ 1.        ,  0.01804864,  0.95186067,  0.48493025,  1.        ,
         0.58777326],
       [ 1.        ,  0.01782688,  0.34980971,  0.48809597,  0.38301677,
         0.55947906],
       [ 1.        ,  0.01756271,  0.2785593 ,  0.48142755,  0.34146526,
         0.59746993],
       [ 2.        ,  0.01735239,  0.83595729,  0.43074661,  1.        ,
         0.57367831],
       [ 2.        ,  0.01654494,  0.20637967,  0.48206231,  0.29958427,
         0.61503088],
       [ 1.        ,  0.01642689,  0.22670071,  0.47663704,  0.30487698,
         0.61679626],
       [ 1.        ,  0.01599239,  0.08420084,  0.49176279,  0.23464306,
         0.66508156],
       [ 2.        ,  0.01591435,  0.90076268,  0.47401932,  0.99643862,
         0.58344311],
       [ 2.        ,  0.01543515,  0.37095672,  0.476096  ,  0.39959997,
         0.53231198],
       [ 1.        ,  0.01503676,  0.37095672,  0.476096  ,  0.39959997,
         0.53231198],
       [ 1.        ,  0.01436276,  0.56283349,  0.51697785,  0.59546703,
         0.57136697],
       [ 2.        ,  0.01367878,  0.52516413,  0.48711935,  0.54848564,
         0.52380419],
       [ 2.        ,  0.01357527,  0.08420084,  0.49176279,  0.23464306,
         0.66508156],
       [ 2.        ,  0.01317639,  0.14880188,  0.47889975,  0.20339857,
         0.52474433],
       [ 2.        ,  0.01305003,  0.54740047,  0.49965698,  0.56710339,
         0.54813772],
       [ 2.        ,  0.012943  ,  0.56283349,  0.51697785,  0.59546703,
         0.57136697],
       [ 1.        ,  0.01247648,  0.26458269,  0.47188008,  0.29670835,
         0.53062999],
       [ 2.        ,  0.01232502,  0.25783744,  0.47463086,  0.30864301,
         0.57865393],
       [ 1.        ,  0.01214787,  0.55362374,  0.48877209,  0.58150011,
         0.54162937],
       [ 1.        ,  0.01206071,  0.21383049,  0.47926959,  0.25187147,
         0.50693351],
       [ 2.        ,  0.01191773,  0.1862261 ,  0.48513433,  0.2282882 ,
         0.53223586],
       [ 2.        ,  0.0117978 ,  0.95186067,  0.48493025,  1.        ,
         0.58777326],
       [ 2.        ,  0.01172777,  0.55362374,  0.48877209,  0.58150011,
         0.54162937],
       [ 2.        ,  0.01165347,  0.49538136,  0.50496268,  0.51638091,
         0.5432148 ],
       [ 1.        ,  0.01133321,  0.64636588,  0.47693557,  0.70046723,
         0.5957914 ],
       [ 1.        ,  0.01121765,  0.31048429,  0.46146703,  0.36130112,
         0.54744411],
       [ 2.        ,  0.01098214,  0.34702232,  0.47309282,  0.3771269 ,
         0.53863573],
       [ 1.        ,  0.01095412,  0.57596916,  0.48762023,  0.62927955,
         0.54987335],
       [ 1.        ,  0.01094619,  0.54740047,  0.49965698,  0.56710339,
         0.54813772],
       [ 2.        ,  0.01087602,  0.57596916,  0.48762023,  0.62927955,
         0.54987335],
       [ 1.        ,  0.01076063,  0.1862261 ,  0.48513433,  0.2282882 ,
         0.53223586],
       [ 2.        ,  0.01053943,  0.58065063,  0.54847533,  0.6064828 ,
         0.5952515 ],
       [ 1.        ,  0.01052055,  0.38822186,  0.48003703,  0.41597652,
         0.52754802],
       [ 1.        ,  0.01048754,  0.90076268,  0.47401932,  0.99643862,
         0.58344311],
       [ 1.        ,  0.01047793,  0.25783744,  0.47463086,  0.30864301,
         0.57865393],
       [ 2.        ,  0.01044348,  0.29853067,  0.47309172,  0.35248682,
         0.57222486],
       [ 1.        ,  0.010319  ,  0.39585865,  0.49431914,  0.42199713,
         0.53557533],
       [ 2.        ,  0.01014104,  0.23518664,  0.50960892,  0.31748837,
         0.60015422],
       [ 1.        ,  0.01008049,  0.53069979,  0.49138734,  0.55163413,
         0.52916205],
       [ 1.        ,  0.01007849,  0.00416986,  0.50594187,  0.05029359,
         0.57190001]])]

I guess it's because of the annotations.

{'object-dataset/1478019952686311006.jpg': array([[ 0.49479167,  0.47833333,  0.52291667,  0.51666667,  0.        ],
        [ 0.91041667,  0.40166667,  0.946875  ,  0.62      ,  1.        ]]),
 'object-dataset/1478019953180167674.jpg': array([[ 0.45416667,  0.48833333,  0.48229167,  0.52666667,  0.        ]]),
 'object-dataset/1478019953689774621.jpg': array([[ 0.35729167,  0.47166667,  0.37916667,  0.515     ,  0.        ],
        [ 0.37291667,  0.48166667,  0.39791667,  0.51833333,  0.        ],
        [ 0.43020833,  0.48333333,  0.45833333,  0.52166667,  0.        ],
        [ 0.80208333,  0.40666667,  0.875     ,  0.50666667,  0.        ],
        [ 0.85729167,  0.415     ,  0.9625    ,  0.495     ,  0.        ]]),
 'object-dataset/1478019954186238236.jpg': array([[ 0.34479167,  0.46833333,  0.36979167,  0.51333333,  0.        ],
        [ 0.35729167,  0.48      ,  0.38020833,  0.52333333,  0.        ],
        [ 0.41770833,  0.485     ,  0.44583333,  0.52333333,  0.        ],
        [ 0.75416667,  0.395     ,  0.896875  ,  0.49666667,  0.        ],
        [ 0.90208333,  0.415     ,  0.99895833,  0.495     ,  0.        ]]),
...

The last column is always 0 or 1 indicating car and pedestrian respectively. I guess it should have been [1, 0] for car and [0, 1] for pedestrian. I will train it again and will get back to you soon.

Btw, In your repo, I saw you apply COCO dataset as well. How did it go?

thanks Peeranat F.

cocoza4 commented 7 years ago

@AloshkaD OK. So I trained a model but this time the result was much better.

Considering the above validation details, the model performed very poorly. This is because the confidence threshold in EvaluationCallback in SSD_trianing

bbox_util.detection_out(pred_prob, keep_top_k=50, confidence_threshold=0.7)

is 0.7 which is too low. So it causes too many FPs and FNs like shown below

If however the threshold is 0.9, the result is much better

I remember using the default weights provided in the SSD repo, the model was very confident, meaning setting the confidence threshold to 0.7 will result in only precise bounding boxes. The irrelevant bounding boxes are usually below 0.4 or 0.5. I have no idea why that's the case. Do you?

At this point, I'm thinking of exploring this repo https://github.com/balancap/SDC-Vehicle-Detection It's written in pure TF and the contributor already incorporated a vehicle dataset. I think it would save me alot of time and effort. I will let you know if it works or not. So how are things at your end?

Btw, thanks alot for your help all along :)

AloshkaD commented 7 years ago

@cocoza4 this is way much better than what I'm getting. I've tried to remove the background class but that didn't help. Is your repo up to date? I will try to build up on what you have used and see why it isn't working for me.

I'm reading through the repo that you want to explore and will let you know what my thoughts are.

mslavescu commented 7 years ago

@AloshkaD @cocoza4 I just got my SSD Tensorflow implementation, for:

https://github.com/OSSDC/OSSDC-VisionBasedACC

working on Windows with Tensorflow 1.2 GPU version.

I'll post more details there, on how to install and use it in the next few days. And more sample videos with live realtime processing in the car, live videos from games and Youtube videos.

To use it with Docker on Linux the instructions are there already.

My version it is a fork of: https://github.com/balancap/SSD-Tensorflow

cocoza4 commented 7 years ago

@AloshkaD Yes, my repo is up to date

AloshkaD commented 7 years ago

Thanks, @mslavescu. It's just like I'm almost there but something tiny is wrong and must be fixed. I want to know what is wrong to learn. I came here to help @cocoza4 but ended up dragged in the current :) it's so tempting!
@cocoza4 @mslavescu I was wondering if the SSD can be implemented with polygon ground truth annotations? I would like to hear your thoughts on that guys. The first thing that comes to my mind is to convert the polygons into rectangles but that is also challenging.

AloshkaD commented 7 years ago

so @cocoza4 t answer your questions, the accuracy = TP / (TP + FN + FP) should work of course. You can define and use it just like I did in other repos in my github. I didn't endup implementing COCO but instead I've implemented VOC dataset which worked well but I had the same issue that you had with the confidence_threshold. As I pointed out earlier, it substantial and a trade off between FP and FN. For me the provided weights also worked much better although I have trained the network until there was no improvement for 30 epochs. I'm the one who should thank you. I've learned new things from you. Let's keep this going, it's a learning opportunity. I have started working on Udacity's SDC Term 3 segmentation assignment and although the network used there is different it is always good to implement different methods.

AloshkaD commented 7 years ago

@cocoza4 I see that your training accuracy never took off beyond 0.3, is that true?

cocoza4 commented 7 years ago

@AloshkaD yes it's true. I don't get it either :( I used the default implementation in Keras.

model.compile(optimizer=optim, metrics=['acc'],
loss=MultiboxLoss(NUM_CLASSES, neg_pos_ratio=2.0).compute_loss)

The repo I sent you earlier uses Kitti data. I might try this dataset too if possible. Segmentation assignment already. That's fast. I will catch up :)