MNIST log - Githubissues

msyim commented 7 years ago

For some reason, BN is not really helping ( in fact, NN using BN has worse performance than the ones not using BN). Will further investigate why.

msyim commented 7 years ago

Performance boost history :

(1) 1 FFNN with 2 hidden layers of size 256 nodes : ~94%
(2) Ensemble of 8 FFNN with 2 hidden layers of size 256 nodes : ~97.3%
(3) (2) but with hidden layers of size 512 nodes : ~97.5%
(4) (3) with decreasing learning rate and total_epoch was increased from 100 to 150 : ~97.6%
(5) (4) with dropout layers (with keep_prob: 0.9) : ~97.9%
(6) (4) with dropout layers (with keep_prob: 0.96) : ~98.2%

msyim commented 7 years ago

Found out that this ensemble network is having trouble differentiating 3's from 8's (and vice versa) and 7's from 9's (and vice versa). I will have two separate networks which specialize in differentiating them and have them verify whenever the ensemble's prediction is one of the aforementioned four.

msyim commented 7 years ago

Apparently "going deeper" gives a better result:

1 FFNN with 4 hidden layers of size 512 nodes: ~98.2%

However, using the above architecture can be problematic with the way I used to initialize the parameters and the optimizer I was using. For the models tested earlier, I used "random normal" to initialize weights and "GradientDescent" for the optimizer, which resulted in "nan" cost.

I first changed the optimizer to "Adam." The cost didn't explode, but the model did not seem to train beyond certain point. After reaching some 97% accuracy, the model suddenly decides to become "dumb" and changes parameters quickly to predict one number for all cases. The accuracy drops rapidly, but for some reason, loss stays low.
Then I changed the optimizer to "Adagrad." The weights did not explode nor did it become dumb, but the training took very long. Towards the end of 150 epochs, the model was slowly reaching 94% accuracy.
Finally, I changed the optimizer back to "Adam" and had the variables have "xavier initializer". It seemd like a perfect combo to use. I reached 98.75% accuracy with an ensemble of 3 "deep" networks. (still working on it)

NOTES : need to study what each optimizer does, and how xavier initializer differs, when to use which optimizer and initializer, etc.

msyim / TensorFlowFun

MNIST log #3