Can training without a pre-trained weight be an option in any possible case?

cyberphantom commented 6 years ago

Thank toy for sharing this crystal clear implementation here. I have a question regarding the VGG16 base network which may not related to the implementation but to share knowledge.

So, I tried to train from scratch without using a pre-trained weights and after finishing the training, it is clear to me that is not an option but I would like to double check. I have to use weights and stick with the SSD16 structure.

In case if I want to remove VGG16 with another base network, let's say ResNet and in this case I have to train from scratch without any pre-trained weights which will not lead to a satisfying mAP. So what are my options here:

If I don't want to use VGG16, and design my own network like your ssd7, and using the same classification layers. what are the possible solutions to do that?
In case I want to use VGG16 as a base network, and modify the classification layers with different convolutions structure, how can I get the weights before let's say fc6 or fc7 and put them on my network?
If I want to design an entire network structure, should I have to train it first (as a classification problem before the detection layers) on a large dataset like ImageNet first and then stick the classification layers to it and see how the detection will perform? I'm asking this because I'm wondering how YOLO were able to perform high mAP without having a base network!

pierluigiferrari commented 6 years ago

So, I tried to train from scratch without using a pre-trained weights and after finishing the training, it is clear to me that is not an option but I would like to double check.

Correct, trying to train SSD300 or SSD512 from scratch without loading the trained VGG-16 weights is not going to work (at least not with the standard random initialization approaches). Note that the original VGG models were not trainable all at once, the VGG creators had to train them in stages (i.e. layer by layer).

In case if I want to remove VGG16 with another base network, let's say ResNet and in this case I have to train from scratch without any pre-trained weights which will not lead to a satisfying mAP.

Two comments on this statement:

You're confusing two things here: The main reason why loading pre-trained VGG-16 weights into SSD300 or SSD512 is necessary is not because you wouldn't get a "satisfying mAP" otherwise, but because the training wouldn't work at all, i.e. it wouldn't converge at all.
You're making an incorrect inference here. You can't make the same assumptions for two completely different network architectures like VGG-16 and ResNet. Meaning: Just because training a VGG-based SSD model from scratch doesn't work without pre-trained VGG weights doesn't imply that the same is true for a ResNet-based SSD model.

Let's compare VGG and ResNet as an example: The reason why training SSD300 from scratch without loading VGG weights doesn't work is because VGG is too deep to be trainable all at once for the rather primitive architecture that it is. This is exactly the reason why the ResNet authors designed ResNet. The whole point of ResNet was to design networks in a way such that they are still trainable end-to-end even if they are very deep. The two main things that ResNet has in this regard which VGG doesn't have are batch normalization and, more importantly, ResNet's "shortcut connections". A ResNet-based SSD can probably be trained end-to-end from random initialization, while an VGG-based SSD cannot.

More generally, if whatever base network architecture you are considering can be trained end-to-end from a randomly initialized starting point, then the same is probably true for an SSD model based on that architecture (since SSD adds only a few layers on top of the base network). If, however, the base network architecture cannot be trained end-to-end from random initialization (as is the case with VGG), then of course the same is definitely true for an SSD based on such a base network.

The main take-away are these:

Whether or not it works to train a given neural network end-to-end from scratch without any pre-trained weights obviously depends on the network. It cannot be said in general.
You can get pretty much any relevant model with trained ImageNet weights, so the question of whether or not an SSD based on such a given model can be trained without pre-trained weights is secondary.

Humayun-CF commented 6 years ago

Hi Pierluigi Ferrari,

Thank you for explaining.

I am sharing my experience of using your SSD model for two different use cases with two different way of training. Use Case 1: Used your model for parking sign detection with VGG model and pre-trained weighted provided and performing well with accuracy of detection Parking sign with 85% Accuracy. The input is standard RGB images (300, 300, 3) Use Case 2: Used your model for Bounding Box Aggregation with VGG model no pre-trained weighted becuase input image size is different than default. The input image is 300, 300, 4 (fourth channel is weighted average of bounding boxes. I can't trained used existing pre-trained weights because the input dimension is different. The dataset for this use case is training and validation set of COCO. I also got ~85% accuracy for bounding box aggregation using this SSD model. Can you comment on it? If you have any better way to use your model for bounding box aggregation with input image have different dimension (instead of 3 channels, either 1 channels or 4 channels).

Best, -humayun

On Fri, Jul 20, 2018 at 4:10 PM Pierluigi Ferrari notifications@github.com wrote:

So, I tried to train from scratch without using a pre-trained weights and after finishing the training, it is clear to me that is not an option but I would like to double check.

Correct, trying to train SSD300 or SSD512 from scratch without loading the trained VGG-16 weights is not going to work (at least not with the standard random initialization approaches). Note that the original VGG models were not trainable all at once, the VGG creators had to train them in stages (i.e. layer by layer).

In case if I want to remove VGG16 with another base network, let's say ResNet and in this case I have to train from scratch without any pre-trained weights which will not lead to a satisfying mAP.

Two comments on this statement:

You're making an incorrect inference here. You can't make the same assumptions for two completely different network architectures like VGG-16 and ResNet. Meaning: Just because training a VGG-based SSD model from scratch doesn't work without pre-trained VGG weights doesn't imply that the same is true for a ResNet-based SSD model.

You're confusing two things here: The main reason why loading pre-trained VGG-16 weights into SSD300 or SSD512 is necessary is not because you wouldn't get a "satisfying mAP" otherwise, but because the training wouldn't work at all, i.e. it wouldn't converge at all.

Let's compare VGG and ResNet as an example: The reason why training SSD300 from scratch without loading VGG weights doesn't work is because VGG is too deep to be trainable all at once for the rather "primitive" architecture that it is. This is exactly the reason why the ResNet authors designed ResNet. The whole point of ResNet was to design networks in a way such that they are still trainable end-to-end even if they are very deep. The two main things that ResNet has in this regard which VGG doesn't have are batch normalization and, more importantly, ResNet's "shortcut connections".

The main take-away are these:

Whether or not it works to train a given neural network end-to-end from scratch without any pre-trained weights obviously depends on the network. It cannot be said in general.

You can get pretty much any relevant model with trained ImageNet weights, so the question of whether or not an SSD based on such a given model can be trained without pre-trained weights is secondary.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pierluigiferrari/ssd_keras/issues/157#issuecomment-406713848, or mute the thread https://github.com/notifications/unsubscribe-auth/AfsUl6BTUbT2z38RHjVS0gBXG0PFwWgKks5uIjlGgaJpZM4VXRth .

--

Humayun Irshad

Lead Scientist, Machine Learning

T 857.225.4227 | @HumayunIrshad https://twitter.com/HumayunIrshad or LinkedIn https://www.linkedin.com/in/humayunirshad/

O 415.985.7125

figure-eight.com

https://figure-eight.com/

The essential Human-in-the-Loop AI platform

for data science and machine learning teams

cyberphantom commented 6 years ago

Perfect! Thank you @pierluigiferrari for your valuable comments. Now I feel that I'm catching it.

OK, after two weeks of struggling as I'm still new to this problem, I want to start fresh and I hope you have time to bare with me on this as everybody here want to hit 77.4% or any other score before taking the next step. I trained on 07p12 train/val using VGG_ILSVRC_16_layers_fc_reduced.h5 but still I don't see convergence and now I'm in the epoch 65 (maybe I'm still early to ask). Last validation Loss improvement recorded at epoch 34 =8.2286. I want to ask about the correct setting that I should follow in ssd300_train, for instance data generation or clip boxes to hit 77.4%, so I can consider it as a starting benchmark for comparison on whatever I want to do later.

This is what in SSD300_train

ssd_data_augmentation = SSDDataAugmentation(img_height=img_height, img_width=img_width, background=mean_color)

Thanks,

cyberphantom commented 6 years ago

I think the issue coming from using python 2 as you mentioned in one comment issue #160 . I will try to reproduce the results with python 3 and see if that is the cause.

cyberphantom commented 6 years ago

This is the results from SSD300 out of the box. I downloaded the code and ran again. I think I had an old version, I don't know aeroplane AP 0.774 bicycle AP 0.839 bird AP 0.737 boat AP 0.642 bottle AP 0.449 bus AP 0.847 car AP 0.842 cat AP 0.864 chair AP 0.559 cow AP 0.785 diningtable AP 0.741 dog AP 0.822 horse AP 0.842 motorbike AP 0.828 person AP 0.762 pottedplant AP 0.505 sheep AP 0.743 sofa AP 0.763 train AP 0.856 tvmonitor AP 0.76 () mAP 0.748

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pierluigiferrari / ssd_keras

Can training without a pre-trained weight be an option in any possible case? #157