Open liux0229 opened 6 years ago
Debug gradient; K = 10; Iter = 10,000; alpha = 0.2; HI = 10
K = 200; Iter = 50,000; alpha = 0.15
K = 200; Iter = 100,000; alpha = 0.15; HI = 30
K=200; Iter=100,000; alpha = 0.15; HI = 100
K = 100; alpha = 0.15; Iter = 100,000; HI = 200
Clearly overfitting
K = 100; alpha = 0.15; Iter = 100,000; HI = 200,100
32 threads for multiplication. 5x speed up.
Consider apply parallelism to addition as well -> unliked to be useful according to profile.
Single future + atomic + no vector => 22.5s (229s): 10x
Vector future => 24.5s (293.5s): 12x
This also shows perf does not tell us the full picture, but it's right to tell us this part of the optimization is not the highest order bit.
HI = 800. capping during matrix multiplication.
i=0 loss: 46.642 i=0 test error rate=83.56% Floating point exception (core dumped)
Cannot converge HI = 800; which looks like numerical stability problem.
model arch: FC [800 ] learning rate = 0.05 iterations=100000 batch=128 Error rate is 4.26%
real 168m28.781s user 3436m12.517s sys 4m16.970s
i=0 loss: 51.7491 i=0 test error rate=92.47% i=1000 loss: 1.31994 i=2000 loss: 0.713257 i=3000 loss: 0.53463 i=4000 loss: 0.366493 i=5000 loss: 0.309956 i=5000 test error rate=5.56%
model arch: FC [800 ] learning rate = 0.01 iterations=100000 batch=128 Error rate is 5.19%
real 73m20.685s user 881m50.573s sys 4m57.976s
fc-layer W gradient ratio: 4.17(1.43%); B gradient ratio: 25.89(0.01%) fc-layer W gradient ratio: 3.83(0.55%); B gradient ratio: 6.30(0.03%) fc-layer W gradient ratio: 3.83(0.52%); B gradient ratio: 3.88(0.05%) l2_regularizer gradient ratio: 4.17(0.05%) 3.83(0.05%) 3.83(0.05%) model arch: FC [200 100 ] learning rate = 0.05 L2(0.005) iterations=100000 batch=128 Error rate is 3.83%
Why would total loss keep increasing?
Why would the regularization cost keep increasing after some point?
Why didn't we drive down the regularization cost more aggressively given it's the dominant term?
Mathematically this does not seem plausible.
Another example to study convergence under HI = 800.
It's very fishy the loss gradient can be in the thousands.
Why would gradients be so big (compared to the regularizer gradient and the regularization function is dominating).
Initial W norm was already big:
Maybe we should try something like Gaussian fill so the initial W norm is much closer to 0?
There is no evidence that w.r.g computation is not correct (there are inaccuracies for w.r.g computation in the debug mode, e.g. when w is very small; however, there are some aspects I don't fully understand, e.g. why does w gradients not equal when w are initialized from the "1000" range).
This is still problematic: ((0.002+0.001)^2 - (0.002-0.001)^2) / 0.002 = 0.004
Why does parameter norm keep increasing, even when we start with a very small point?
i=85000 loss: training:0.1082836169372637 regularizer:0 total:0.1082836169372637 i=85000 test error rate=4.47% cnn-layer [ 2 28 28 ] W gradient ratio: 3.99(0.03%); B gradient ratio: 2.25(0.02%) cnn-layer [ 2 14 14 ] W gradient ratio: 4.21(0.07%); B gradient ratio: 0.65(0.01%) fc-layer [ 100 ] W gradient ratio: 6.74(0.29%); B gradient ratio: 5.70(0.01%) fc-layer [ 10 ] W gradient ratio: 4.14(0.23%); B gradient ratio: 2.00(0.02%) l2_regularizer gradient ratio: 6.74(0.00%) 4.14(0.00%)
Convolve becomes the most expensive.
model arch: CNN 10,4 Pool 4,4 CNN 40,4 Pool 4,4 FC [ 200 ] learning rate = 0.005 L2(0) iterations=100000 batch=64 Error rate is 2.32% [ 0.918367 0.792952 1.25969 2.07921 1.52749 2.01794 5.21921 5.35019 1.64271 2.57681 ]
i=98000 loss: training:0.03525359210391027 regularizer:0.3732135181453493 total:0.4084671102492596 cnn-layer [ 10 28 28 ] W gradient ratio: 4.86(0.01%); B gradient ratio: 1.85(0.01%) cnn-layer [ 40 7 7 ] W gradient ratio: 17.76(0.01%); B gradient ratio: 3.29(0.00%) fc-layer [ 500 ] W gradient ratio: 4.31(0.13%); B gradient ratio: 12.45(0.00%) fc-layer [ 10 ] W gradient ratio: 3.94(0.03%); B gradient ratio: 1.71(0.00%) i=99000 loss: training:0.0372203400475962 regularizer:0.3673565987226719 total:0.4045769387702681 cnn-layer [ 10 28 28 ] W gradient ratio: 4.85(0.01%); B gradient ratio: 1.85(0.01%) cnn-layer [ 40 7 7 ] W gradient ratio: 17.59(0.01%); B gradient ratio: 3.29(0.00%) fc-layer [ 500 ] W gradient ratio: 4.32(0.08%); B gradient ratio: 12.45(0.00%) fc-layer [ 10 ] W gradient ratio: 3.95(0.02%); B gradient ratio: 1.71(0.00%) model arch: CNN 10,4 Pool 4,4 CNN 40,4 Pool 4,4 FC [ 500 ] learning rate = 0.005 L2(0.001) iterations=100000 batch=128 Error rate is 1.67% [ 1.02041 0.440529 1.93798 1.88119 1.32383 1.68161 2.08768 1.55642 1.64271 3.27056 ]
i=99000 loss: training:0.011632655539341322 regularizer:0.1302070264706461 total:0.14183968200998742 cnn-layer [ 10 28 28 ] W gradient ratio: 3.52(0.00%); B gradient ratio: 2.37(0.00%) cnn-layer [ 10 28 28 ] W gradient ratio: 7.09(0.00%); B gradient ratio: 1.50(0.00%) cnn-layer [ 10 14 14 ] W gradient ratio: 6.91(0.01%); B gradient ratio: 1.99(0.00%) fc-layer [ 800 ] W gradient ratio: 3.28(0.07%); B gradient ratio: 16.23(0.00%) fc-layer [ 10 ] W gradient ratio: 3.01(0.02%); B gradient ratio: 1.40(0.00%) model arch: CNN 10,3 CNN 10,3 Pool 2,2 CNN 10,3 Pool 2,2 FC [ 800 ] learning rate = 0.005 L2(0.001) iterations=100000 batch=128 Error rate is 1.31% [ 0.306122 0.264317 1.25969 1.08911 1.52749 0.784753 2.29645 1.75097 1.43737 2.4777 ]
ConvolutionalLayer performance improvements:
By unrolling dot(MatrixPatch, MatrixPatch) and avoid the regions of 0 we achieved 2.5x - 3x performance improvements.
model arch: CNN 10,3 CNN 10,3 Pool 2,2 CNN 10,3 Pool 2,2 FC [ 800 ] learning rate = 0.005 L2(0.001) iterations=100 batch=128 Error rate is 51.739999999999995% [ 6.22449 12.6872 67.0543 21.6832 81.5682 73.3184 82.9854 1.36187 98.768 82.4579 ]
real 4m17.714s user 83m4.199s sys 0m37.779s
-->
model arch: CNN 10,3 CNN 10,3 Pool 2,2 CNN 10,3 Pool 2,2 FC [ 800 ] learning rate = 0.005 L2(0.001) iterations=100 batch=128 Error rate is 40.48% [ 24.1837 0.264317 50.3876 38.2178 27.0876 80.7175 41.2317 18.5798 76.386 58.0773 ]
real 1m40.879s user 30m58.608s sys 1m7.655s
i=99000 loss: training:0.0003670514829875239 regularizer:0 total:0.0003670514829875239 cnn-layer [ 10 28 28 ] W gradient ratio: 6.01(0.00%); B gradient ratio: 1.75(0.00%) cnn-layer [ 10 28 28 ] W gradient ratio: 17.42(0.00%); B gradient ratio: 1.51(0.00%) cnn-layer [ 10 14 14 ] W gradient ratio: 17.62(0.00%); B gradient ratio: 1.92(0.00%) fc-layer [ 800 ] W gradient ratio: 3.42(0.01%); B gradient ratio: 16.06(0.00%) fc-layer [ 10 ] W gradient ratio: 2.91(0.00%); B gradient ratio: 1.79(0.00%) model arch: CNN 10,3 CNN 10,3 Pool 2,2 CNN 10,3 Pool 2,2 FC [ 800 ] learning rate = 0.005 L2(0) iterations=100000 batch=128 Error rate is 1.35% [ 0.816327 0.440529 1.25969 0.990099 0.916497 1.34529 1.87891 1.6537 1.95072 2.37859 ]
real 430m27.284s
i=99000 loss: training:0.004228097360668435 regularizer:0 total:0.004228097360668435 cnn-layer [ 10 28 28 ] W gradient ratio: 6.77(0.00%); B gradient ratio: 2.10(0.00%) cnn-layer [ 10 14 14 ] W gradient ratio: 17.99(0.00%); B gradient ratio: 1.77(0.00%) fc-layer [ 800 ] W gradient ratio: 4.59(0.02%); B gradient ratio: 16.13(0.00%) fc-layer [ 10 ] W gradient ratio: 4.08(0.00%); B gradient ratio: 2.10(0.00%) model arch: CNN 10,3 Pool 2,2 CNN 10,3 Pool 2,2 FC [ 800 ] learning rate = 0.005 L2(0) iterations=100000 miniBatch=128 Error rate is 1.23% [ 0.612245 0.881057 1.45349 0.891089 1.12016 1.00897 1.46138 0.972763 1.3347 2.57681 ]
Looks like the first and second layer is detecting something, but it's not conclusive what they are looking for. It also becomes harder to visualize a filter that needs to look at 10 channels.
Increasing the size of the detector (4 or 5) to see what I would get.
i=99000 loss: training:0.002314503058575068 regularizer:0 total:0.002314503058575068 cnn-layer [ 10 28 28 ] W gradient ratio: 7.84(0.00%); B gradient ratio: 1.78(0.00%) cnn-layer [ 10 14 14 ] W gradient ratio: 23.30(0.00%); B gradient ratio: 1.86(0.00%) fc-layer [ 400 ] W gradient ratio: 4.47(0.04%); B gradient ratio: 11.64(0.00%) fc-layer [ 10 ] W gradient ratio: 3.72(0.01%); B gradient ratio: 1.72(0.00%) model arch: CNN 10,4 Pool 2,2 CNN 10,4 Pool 2,2 FC [ 400 ] learning rate = 0.005 L2(0) iterations=100000 miniBatch=128 Error rate is 1.41% [ 0.816327 0.176211 1.45349 1.18812 1.01833 1.34529 1.9833 1.84825 1.54004 2.87413 ]
i=99000 loss: training:0.003968845446158552 regularizer:0 total:0.003968845446158552 cnn-layer [ 10 28 28 ] W gradient ratio: 8.02(0.00%); B gradient ratio: 2.11(0.00%) cnn-layer [ 10 14 14 ] W gradient ratio: 23.23(0.00%); B gradient ratio: 1.81(0.00%) cnn-layer [ 10 7 7 ] W gradient ratio: 23.57(0.00%); B gradient ratio: 1.65(0.00%) fc-layer [ 100 ] W gradient ratio: 4.52(0.03%); B gradient ratio: 5.82(0.00%) fc-layer [ 10 ] W gradient ratio: 3.15(0.01%); B gradient ratio: 1.40(0.00%) model arch: CNN 10,4 Pool 2,2 CNN 10,4 Pool 2,2 CNN 10,4 Pool 2,2 FC [ 100 ] learning rate = 0.005 L2(0) iterations=100000 miniBatch=128 Error rate is 2.46% [ 1.12245 0.440529 3.39147 2.07921 2.24033 1.90583 1.56576 2.91829 3.49076 5.55005 ]
This configuration converges very slow (if at all).
Need to check implementation for pool(3,2) configuration.
model arch: CNN 32,4 Pool 3,2 CNN 16,4 Pool 3,2 CNN 16,3 Pool 3,2 FC [ 100 ] learning rate = 0.005 L2(0) iterations=100000 miniBatch=128 input [ 784 ] input [ ] adapter [ 1 28 28 ] cnn-layer [ 32 28 28 ] relu pooling-layer [ 32 14 14 ] w:3;s:2 cnn-layer [ 16 14 14 ] relu pooling-layer [ 16 7 7 ] w:3;s:2 cnn-layer [ 16 7 7 ] relu pooling-layer [ 16 4 4 ] w:3;s:2 adapter [ 256 ] fc-layer [ 100 ] relu fc-layer [ 10 ] softmax_loss i=0 loss: training:2.403120321884601 regularizer:0 total:2.403120321884601 i=0 test error rate=88.64999999999999% cnn-layer [ 32 28 28 ] W gradient ratio: 12.99(0.00%); B gradient ratio: 3.66(0.00%) cnn-layer [ 16 14 14 ] W gradient ratio: 51.94(0.00%); B gradient ratio: 2.11(0.00%) cnn-layer [ 16 7 7 ] W gradient ratio: 27.24(0.00%); B gradient ratio: 2.42(0.00%) fc-layer [ 100 ] W gradient ratio: 0.26(27.10%); B gradient ratio: 5.77(0.00%) fc-layer [ 10 ] W gradient ratio: 0.17(14.23%); B gradient ratio: 1.50(0.05%) i=1000 loss: training:2.3403331125269626 regularizer:0 total:2.3403331125269626 cnn-layer [ 32 28 28 ] W gradient ratio: 12.99(0.00%); B gradient ratio: 3.66(0.00%) cnn-layer [ 16 14 14 ] W gradient ratio: 51.94(0.00%); B gradient ratio: 2.11(0.00%) cnn-layer [ 16 7 7 ] W gradient ratio: 27.24(0.00%); B gradient ratio: 2.42(0.00%) fc-layer [ 100 ] W gradient ratio: 0.31(0.00%); B gradient ratio: 5.77(0.00%) fc-layer [ 10 ] W gradient ratio: 0.26(0.00%); B gradient ratio: 0.92(0.07%) i=2000 loss: training:2.3155112027483136 regularizer:0 total:2.3155112027483136 cnn-layer [ 32 28 28 ] W gradient ratio: 12.99(0.00%); B gradient ratio: 3.66(0.00%) cnn-layer [ 16 14 14 ] W gradient ratio: 51.94(0.00%); B gradient ratio: 2.11(0.00%) cnn-layer [ 16 7 7 ] W gradient ratio: 27.24(0.00%); B gradient ratio: 2.42(0.00%) fc-layer [ 100 ] W gradient ratio: 0.31(0.00%); B gradient ratio: 5.77(0.00%) fc-layer [ 10 ] W gradient ratio: 0.26(0.00%); B gradient ratio: 0.58(0.06%) i=3000 loss: training:2.306512662797641 regularizer:0 total:2.306512662797641 cnn-layer [ 32 28 28 ] W gradient ratio: 12.99(0.00%); B gradient ratio: 3.66(0.00%) cnn-layer [ 16 14 14 ] W gradient ratio: 51.94(0.00%); B gradient ratio: 2.11(0.00%) cnn-layer [ 16 7 7 ] W gradient ratio: 27.24(0.00%); B gradient ratio: 2.42(0.00%) fc-layer [ 100 ] W gradient ratio: 0.31(0.00%); B gradient ratio: 5.77(0.00%) fc-layer [ 10 ] W gradient ratio: 0.26(0.00%); B gradient ratio: 0.38(0.07%)
CNN gradient = 0
I realized my CNN may have never learned anything. It was just the MLP layer which was learning.
i=0 loss: training:2.4306220533431264 regularizer:0 total:2.4306220533431264 i=0 test error rate=88.64999999999999% cnn-layer [ 32 28 28 ] W gradient ratio: 0.00/9.80(0.00%); B gradient ratio: 0.00/2.96(0.00%) cnn-layer [ 64 14 14 ] W gradient ratio: 0.00/78.73(0.00%); B gradient ratio: 0.00/4.39(0.00%) cnn-layer [ 64 7 7 ] W gradient ratio: 0.00/111.31(0.00%); B gradient ratio: 0.00/4.53(0.00%) fc-layer [ 64 ] W gradient ratio: 0.14(299.05%); B gradient ratio: 4.75(0.00%) fc-layer [ 10 ] W gradient ratio: 0.20(14.86%); B gradient ratio: 1.71(0.10%)
It's not completely 0. It's very small.
cnn-layer [ 32 28 28 ] W gradient ratio: 0.00027/10.26(0.00%); B gradient ratio: 0.00012/3.39(0.00%) cnn-layer [ 64 14 14 ] W gradient ratio: 0.00026/78.34(0.00%); B gradient ratio: 0.00001/4.54(0.00%) cnn-layer [ 64 7 7 ] W gradient ratio: 0.00028/111.02(0.00%); B gradient ratio: 0.00000/4.77(0.00%) fc-layer [ 64 ] W gradient ratio: 0.14(241.31%); B gradient ratio: 4.17(0.00%) fc-layer [ 10 ] W gradient ratio: 0.19(17.40%); B gradient ratio: 2.10(0.12%)
model arch: CNN 32,3 Pool 2,2 CNN 64,3 Pool 2,2 CNN 64,3 Pool 2,2 FC [ 64 ] learning rate = 0.1 L2(0) iterations=50000 miniBatch=128 input [ 784 ] input [ ] adapter [ 1 28 28 ] cnn-layer [ 32 28 28 ] relu pooling-layer [ 32 14 14 ] w:2;s:2 cnn-layer [ 64 14 14 ] relu pooling-layer [ 64 7 7 ] w:2;s:2 cnn-layer [ 64 7 7 ] relu pooling-layer [ 64 4 4 ] w:2;s:2 adapter [ 1024 ] fc-layer [ 64 ] relu fc-layer [ 10 ] softmax_loss i=0 loss: training:2.535392705283118 regularizer:0 total:2.535392705283118 i=0 test error rate=90.42% cnn-layer [ 32 28 28 ] W gradient ratio: 0.00335/9.73(0.03%); B gradient ratio: 0.00141/3.43(0.04%) cnn-layer [ 64 14 14 ] W gradient ratio: 0.00403/78.56(0.01%); B gradient ratio: 0.00019/4.46(0.00%) cnn-layer [ 64 7 7 ] W gradient ratio: 0.00427/110.74(0.00%); B gradient ratio: 0.00002/4.79(0.00%) fc-layer [ 64 ] W gradient ratio: 0.14(3993.38%); B gradient ratio: 4.81(0.03%) fc-layer [ 10 ] W gradient ratio: 0.20(200.17%); B gradient ratio: 2.29(1.14%) Floating point exception (core dumped)
https://github.com/liux0229/scratch/edit/master/mnist/readme.md