Approach for the Traffic Sign Recognition challenge

pjkirsch commented 9 years ago

Here is the approach I chose for the Traffic Sign Recognition. It is mostly inspired from the approach suggested in paper A: "Traffic Sign Recognition with Multi-Scale Convolutional Networks", by Yann LeCun et al.

SGD is used for optimization.

1) Data preparation : Follow the same approach as 'paper A' for preprocessing and validation set extraction 2) Train a toy MLP (No convolutional layer) with images directly used as input features, 1 hidden layer and a softmax layer to check the scripts EDIT: 2bis) Check the assumption from "paper A" that color info does not provide any improvement. If verified, keep only greyscale image for further tests. 3) Design a small Convolutional Neural Network. 4) Tune the learning rate 5) Increase the number of kernels and see the influence on results 6) Tune optimization hyperparameters 7) Go deeper (add convolutional lower layers, then full higher layers)

Potential ways of improvement:

Different learning rate for each layer
Take advantage to color : Use a 'color filter layer' for the first layer (color transform with weight mapping as network parameters).
Unsupervised pretraining : denoising autoencoders (can it be applied to CNN?)
Multi-Scale
Jitter dataset

pjkirsch commented 9 years ago

Report of first results:

All experiments have been done using standard SGD with a 10e-3 initial learning rate and a 10e-7 learning rate decay

Note: The "official score" is the accuracy achieved by the network on the test set at the epoch with the best score on validation set. The "best score" is the best accuracy achieved by the network on the test set over all epochs.

Exp1: mlp-toy1 Input: 32x32 YUV images Hidden layer 1: 30 tanh units Output layer: 43 softmax classifier Cost function: log-likelihood Official score: 71.78% after 8 epochs Best score: 85.71% after 9 epochs

Exp2: mlp-toy2 Input: 32x32 Y images (grey-scale) Hidden layer 1: 30 ReLU units Output layer: 43 softmax classifier Cost function: log-likelihood Official score: 87.40% after 19 epochs Best score: 87.40% after 19 epochs

Exp3: cnn1 Input: 32x32 Y images (grey-scale) Hidden layer 1: Conv. 8 kernels of size 1x5x5, tanh units, max-pooling Hidden layer 2: Conv. 8 kernels of size 8x3x3, tanh units, max-pooling Hidden layer 3: 30 ReLU units Output layer: 43 softmax classifier Cost function: log-likelihood Official score: 94.11% after 28 epochs Best score: ==> 94.41% after 27 epochs <==

Exp4: cnn2 (still running...) Input: 32x32 Y images (grey-scale) Hidden layer 1: Conv. 16 kernels of size 1x5x5, tanh units, max-pooling Hidden layer 2: Conv. 8 kernels of size 8x3x3, tanh units, max-pooling Hidden layer 3: 30 ReLU units Output layer: 43 softmax classifier Cost function: log-likelihood Official score: 93.70% after 19 epochs Best score: 93.84% after 17 epochs

According to the preliminary results from Exp4, more kernels for first hidden layer do not provide much improvements. Hypothesis 1: HL2 too small to take advantage from bigger HL1 --> provide more kernels to HL2 Hypothesis 2: HL1 learnt kernels are redundant --> Try Dropout technique Hypothesis 3: Too much overfitting --> Try Dropout + other regularizations (L2...)

Note: Experiments on cnn1 to tune hyperparameters indicate that an initial learning rate of 10e-2 would be better (faster convergence).

pjkirsch commented 9 years ago

Results after a week end of long computations:

Exp5: cnnDropOut1 Input: 32x32 Y images (grey-scale) Hidden layer 1: Conv. 16 kernels of size 1x5x5, tanh units, max-pooling Hidden layer 2: Conv. 16 kernels of size 8x3x3, tanh units, max-pooling Hidden layer 3: 60 ReLU units Output layer: 43 softmax classifier Drop-out: Before each hidden layer, 0.5 drop-out probability Cost function: log-likelihood Optimization: learning rate 1e-3, decay 1e-7, no momentum Official score: 89.78% after 5 epochs Best score: 89.78% after 5 epochs

Exp6: cnnDropOut1 Input: 32x32 Y images (grey-scale) Hidden layer 1: Conv. 16 kernels of size 1x5x5, tanh units, max-pooling Hidden layer 2: Conv. 16 kernels of size 8x3x3, tanh units, max-pooling Hidden layer 3: 60 ReLU units Output layer: 43 softmax classifier Drop-out: Before each hidden layer, 0.5 drop-out probability Cost function: log-likelihood Optimization: learning rate 1e-2, decay 1e-7, no momentum Official score: 71.40% after 1 epoch Best score: 71.40% after 1 epoch

Exp7: cnnDropOut1 Input: 32x32 Y images (grey-scale) Hidden layer 1: Conv. 16 kernels of size 1x5x5, tanh units, max-pooling Hidden layer 2: Conv. 16 kernels of size 8x3x3, tanh units, max-pooling Hidden layer 3: 60 ReLU units Output layer: 43 softmax classifier Drop-out: Before each hidden layer, 0.5 drop-out probability Cost function: log-likelihood Optimization: learning rate 1e-3, decay 1e-7, with momentum 0.9 Official score: 95.27% after 35 epoch Best score: 95.69% after 43 epoch

Exp8: cnnDropOut1 Input: 32x32 Y images (grey-scale) Hidden layer 1: Conv. 16 kernels of size 1x5x5, tanh units, max-pooling Hidden layer 2: Conv. 16 kernels of size 8x3x3, tanh units, max-pooling Hidden layer 3: 60 ReLU units Output layer: 43 softmax classifier Drop-out: Before each hidden layer, 0.5 drop-out probability Cost function: log-likelihood Optimization: learning rate 1e-3, decay 1e-6, with momentum 0.9 Official score: 95.04% after 53 epoch Best score: 95.22% after 52 epoch

Exp9: cnnDropOut2 Input: 32x32 Y images (grey-scale) Hidden layer 1: Conv. 32 kernels of size 1x5x5, tanh units, max-pooling Hidden layer 2: Conv. 32 kernels of size 8x3x3, tanh units, max-pooling Hidden layer 3: 60 ReLU units Output layer: 43 softmax classifier Drop-out: Before each hidden layer, 0.6 drop-out probability Cost function: log-likelihood Optimization: learning rate 1e-3, decay 1e-7, with momentum 0.9 Official score: 96.09% after 85 epoch Best score: ==> 96.52% after 93 epoch <==

Notes:

The experiments are stopped when I consider that the model is diverging
The objective of exps 5 and 6 was to find a good initial learning rate by checking the learning evolution on the 5 first epochs. Exp6 diverged quickly.
It seems that accelerating the learning rate decay does not provide better results in Exp8. However, the difference is not significant enough to be generalized based on only 1 test.
Exp9 provides the best results for now. However, its optimization hyperparameters should be tuned to provide better results.
In exp9, a 0.6 drop-out probability has been used but I intended to use a 0.4 drop-out instead, because this usually provides slightly better results. This assumption should be checked with additional experiments.

pjkirsch / gtsrb

Approach for the Traffic Sign Recognition challenge #1