Closed xtj49 closed 5 years ago
The error maybe in the swish function. Because sigmoid function will work.
The problem is in the power layer( x/6)
The problem is in the power layer( x/6)
Use an extra Eltwise PROD layer, And it should use the ReLU6 layer rather than ReLU layer in the h-swish function?
The problem is in the power layer( x/6)
Use an extra Eltwise PROD layer, And it should use the ReLU6 layer rather than ReLU layer in the h-swish function?
But there is a question, Eltwise layer deals with matrix and matrix , but not matrix and constant?
The problem is in the power layer( x/6)
what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.
The problem is in the power layer( x/6)
what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.
The source code is right. But I find when the scale coefficient of power layer is in [0,1], the train loss will stay the same all the time. When I get rid of this "/6" layer, the loss will change. Can the network work well with your data?
The problem is in the power layer( x/6)
what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.
The source code is right. But I find when the scale coefficient of power layer is in [0,1], the train loss will stay the same all the time. When I get rid of this "/6" layer, the loss will change. Can the network work well with your data?
The network not work well. The loss also stayed the same value.
The problem is in the power layer( x/6)
what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.
The source code is right. But I find when the scale coefficient of power layer is in [0,1], the train loss will stay the same all the time. When I get rid of this "/6" layer, the loss will change. Can the network work well with your data?
Since my gpu is still running other tasks, I have not tried this yet. I am not sure if power layer support in-place operation. But the definition of the network structure uses the in-place operation.
The problem is in the power layer( x/6)
what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.
The source code is right. But I find when the scale coefficient of power layer is in [0,1], the train loss will stay the same all the time. When I get rid of this "/6" layer, the loss will change. Can the network work well with your data?
I have tested the prototxt on MNIST dataset and got the same result. The training log is as follows:
I0515 11:05:31.873651 37636 solver.cpp:239] Iteration 0 (-nan iter/s, 26.7698s/20 iters), loss = 6.91161
I0515 11:05:31.873690 37636 solver.cpp:258] Train net output #0: loss = 6.91161 (* 1 = 6.91161 loss)
I0515 11:05:31.873761 37636 sgd_solver.cpp:112] Iteration 0, lr = 0.01
I0515 11:05:51.320263 37636 solver.cpp:239] Iteration 20 (1.02848 iter/s, 19.4461s/20 iters), loss = 6.90501
I0515 11:05:51.320302 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:05:51.320356 37636 sgd_solver.cpp:112] Iteration 20, lr = 0.01
I0515 11:06:10.302816 37636 solver.cpp:239] Iteration 40 (1.05363 iter/s, 18.9821s/20 iters), loss = 6.90635
I0515 11:06:10.303040 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:10.303057 37636 sgd_solver.cpp:112] Iteration 40, lr = 0.01
I0515 11:06:29.307520 37636 solver.cpp:239] Iteration 60 (1.05241 iter/s, 19.004s/20 iters), loss = 6.90681
I0515 11:06:29.307559 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:29.307596 37636 sgd_solver.cpp:112] Iteration 60, lr = 0.01
I0515 11:06:48.303532 37636 solver.cpp:239] Iteration 80 (1.05288 iter/s, 18.9955s/20 iters), loss = 6.90704
I0515 11:06:48.303723 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:48.303740 37636 sgd_solver.cpp:112] Iteration 80, lr = 0.01
I0515 11:07:07.300627 37636 solver.cpp:239] Iteration 100 (1.05283 iter/s, 18.9964s/20 iters), loss = 6.90714
I0515 11:07:07.300673 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:07.300725 37636 sgd_solver.cpp:112] Iteration 100, lr = 0.01
I0515 11:07:26.307842 37636 solver.cpp:239] Iteration 120 (1.05226 iter/s, 19.0067s/20 iters), loss = 6.90776
I0515 11:07:26.308053 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:26.308068 37636 sgd_solver.cpp:112] Iteration 120, lr = 0.01
I0515 11:07:45.309257 37636 solver.cpp:239] Iteration 140 (1.05259 iter/s, 19.0007s/20 iters), loss = 6.90776
I0515 11:07:45.309298 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:45.309329 37636 sgd_solver.cpp:112] Iteration 140, lr = 0.01
I0515 11:08:04.307058 37636 solver.cpp:239] Iteration 160 (1.05278 iter/s, 18.9973s/20 iters), loss = 6.90776
I0515 11:08:04.307271 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:08:04.307287 37636 sgd_solver.cpp:112] Iteration 160, lr = 0.01
I0515 11:08:23.303742 37636 solver.cpp:239] Iteration 180 (1.05285 iter/s, 18.996s/20 iters), loss = 6.90776
I0515 11:08:23.303779 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:08:23.303833 37636 sgd_solver.cpp:112] Iteration 180, lr = 0.01
I0515 11:08:41.365075 37636 solver.cpp:347] Iteration 200, Testing net (#0)
I0515 11:08:41.365545 37636 net.cpp:678] Ignoring source layer loss
I0515 11:09:03.696059 37636 solver.cpp:414] Test net output #0: accuracy = 0.1032
I0515 11:09:04.653638 37636 solver.cpp:239] Iteration 200 (0.48369 iter/s, 41.3488s/20 iters), loss = 6.90776
I0515 11:09:04.653683 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:09:04.653744 37636 sgd_solver.cpp:112] Iteration 200, lr = 0.01I0515 11:05:31.873651 37636 solver.cpp:239] Iteration 0 (-nan iter/s, 26.7698s/20 iters), loss = 6.91161
I0515 11:05:31.873690 37636 solver.cpp:258] Train net output #0: loss = 6.91161 (* 1 = 6.91161 loss)
I0515 11:05:31.873761 37636 sgd_solver.cpp:112] Iteration 0, lr = 0.01
I0515 11:05:51.320263 37636 solver.cpp:239] Iteration 20 (1.02848 iter/s, 19.4461s/20 iters), loss = 6.90501
I0515 11:05:51.320302 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:05:51.320356 37636 sgd_solver.cpp:112] Iteration 20, lr = 0.01
I0515 11:06:10.302816 37636 solver.cpp:239] Iteration 40 (1.05363 iter/s, 18.9821s/20 iters), loss = 6.90635
I0515 11:06:10.303040 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:10.303057 37636 sgd_solver.cpp:112] Iteration 40, lr = 0.01
I0515 11:06:29.307520 37636 solver.cpp:239] Iteration 60 (1.05241 iter/s, 19.004s/20 iters), loss = 6.90681
I0515 11:06:29.307559 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:29.307596 37636 sgd_solver.cpp:112] Iteration 60, lr = 0.01
I0515 11:06:48.303532 37636 solver.cpp:239] Iteration 80 (1.05288 iter/s, 18.9955s/20 iters), loss = 6.90704
I0515 11:06:48.303723 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:48.303740 37636 sgd_solver.cpp:112] Iteration 80, lr = 0.01
I0515 11:07:07.300627 37636 solver.cpp:239] Iteration 100 (1.05283 iter/s, 18.9964s/20 iters), loss = 6.90714
I0515 11:07:07.300673 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:07.300725 37636 sgd_solver.cpp:112] Iteration 100, lr = 0.01
I0515 11:07:26.307842 37636 solver.cpp:239] Iteration 120 (1.05226 iter/s, 19.0067s/20 iters), loss = 6.90776
I0515 11:07:26.308053 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:26.308068 37636 sgd_solver.cpp:112] Iteration 120, lr = 0.01
I0515 11:07:45.309257 37636 solver.cpp:239] Iteration 140 (1.05259 iter/s, 19.0007s/20 iters), loss = 6.90776
I0515 11:07:45.309298 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:45.309329 37636 sgd_solver.cpp:112] Iteration 140, lr = 0.01
I0515 11:08:04.307058 37636 solver.cpp:239] Iteration 160 (1.05278 iter/s, 18.9973s/20 iters), loss = 6.90776
I0515 11:08:04.307271 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:08:04.307287 37636 sgd_solver.cpp:112] Iteration 160, lr = 0.01
I0515 11:08:23.303742 37636 solver.cpp:239] Iteration 180 (1.05285 iter/s, 18.996s/20 iters), loss = 6.90776
I0515 11:08:23.303779 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:08:23.303833 37636 sgd_solver.cpp:112] Iteration 180, lr = 0.01
I0515 11:08:41.365075 37636 solver.cpp:347] Iteration 200, Testing net (#0)
I0515 11:08:41.365545 37636 net.cpp:678] Ignoring source layer loss
I0515 11:09:03.696059 37636 solver.cpp:414] Test net output #0: accuracy = 0.1032
I0515 11:09:04.653638 37636 solver.cpp:239] Iteration 200 (0.48369 iter/s, 41.3488s/20 iters), loss = 6.90776
I0515 11:09:04.653683 37636 solver.cpp:258] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:09:04.653744 37636 sgd_solver.cpp:112] Iteration 200, lr = 0.01
Then I changed the power layer to non-inplaced. The training loss converges hundreds of times and the training log is as follows:
I0515 10:55:12.849515 35382 solver.cpp:239] Iteration 0 (-nan iter/s, 27.2175s/20 iters), loss = 2.27915
I0515 10:55:12.849555 35382 solver.cpp:258] Train net output #0: loss = 2.27915 (* 1 = 2.27915 loss)
I0515 10:55:12.849609 35382 sgd_solver.cpp:112] Iteration 0, lr = 0.01
I0515 10:55:32.186156 35382 solver.cpp:239] Iteration 20 (1.03433 iter/s, 19.3361s/20 iters), loss = 2.2192
I0515 10:55:32.186208 35382 solver.cpp:258] Train net output #0: loss = 2.13934 (* 1 = 2.13934 loss)
I0515 10:55:32.186231 35382 sgd_solver.cpp:112] Iteration 20, lr = 0.01
I0515 10:55:51.084617 35382 solver.cpp:239] Iteration 40 (1.05832 iter/s, 18.8979s/20 iters), loss = 2.12924
I0515 10:55:51.084879 35382 solver.cpp:258] Train net output #0: loss = 2.03532 (* 1 = 2.03532 loss)
I0515 10:55:51.084897 35382 sgd_solver.cpp:112] Iteration 40, lr = 0.01
I0515 10:56:09.980957 35382 solver.cpp:239] Iteration 60 (1.05845 iter/s, 18.8956s/20 iters), loss = 2.04191
I0515 10:56:09.980991 35382 solver.cpp:258] Train net output #0: loss = 1.8054 (* 1 = 1.8054 loss)
I0515 10:56:09.981035 35382 sgd_solver.cpp:112] Iteration 60, lr = 0.01
I0515 10:56:28.875658 35382 solver.cpp:239] Iteration 80 (1.05853 iter/s, 18.8942s/20 iters), loss = 1.98364
I0515 10:56:28.875864 35382 solver.cpp:258] Train net output #0: loss = 1.91929 (* 1 = 1.91929 loss)
I0515 10:56:28.875878 35382 sgd_solver.cpp:112] Iteration 80, lr = 0.01
I0515 10:56:47.791654 35382 solver.cpp:239] Iteration 100 (1.05735 iter/s, 18.9153s/20 iters), loss = 1.9244
I0515 10:56:47.791693 35382 solver.cpp:258] Train net output #0: loss = 1.8284 (* 1 = 1.8284 loss)
I0515 10:56:47.791745 35382 sgd_solver.cpp:112] Iteration 100, lr = 0.01
I0515 10:57:06.693739 35382 solver.cpp:239] Iteration 120 (1.05811 iter/s, 18.9016s/20 iters), loss = 1.7903
I0515 10:57:06.693966 35382 solver.cpp:258] Train net output #0: loss = 1.4549 (* 1 = 1.4549 loss)
I0515 10:57:06.693981 35382 sgd_solver.cpp:112] Iteration 120, lr = 0.01
I0515 10:57:25.593870 35382 solver.cpp:239] Iteration 140 (1.05823 iter/s, 18.8994s/20 iters), loss = 1.66322
I0515 10:57:25.593910 35382 solver.cpp:258] Train net output #0: loss = 1.29358 (* 1 = 1.29358 loss)
I0515 10:57:25.593968 35382 sgd_solver.cpp:112] Iteration 140, lr = 0.01
I0515 10:57:44.503170 35382 solver.cpp:239] Iteration 160 (1.05771 iter/s, 18.9088s/20 iters), loss = 1.52208
I0515 10:57:44.503396 35382 solver.cpp:258] Train net output #0: loss = 1.03544 (* 1 = 1.03544 loss)
I0515 10:57:44.503412 35382 sgd_solver.cpp:112] Iteration 160, lr = 0.01
I0515 10:58:03.408309 35382 solver.cpp:258] Train net output #0: loss = 0.836138 (* 1 = 0.836138 loss)
I0515 10:58:03.408349 35382 sgd_solver.cpp:112] Iteration 180, lr = 0.01
I0515 10:58:21.384106 35382 solver.cpp:347] Iteration 200, Testing net (#0)
I0515 10:58:21.384551 35382 net.cpp:678] Ignoring source layer loss
I0515 10:58:44.481245 35382 solver.cpp:414] Test net output #0: accuracy = 0.1787
I0515 10:58:45.433612 35382 solver.cpp:239] Iteration 200 (0.475915 iter/s, 42.0243s/20 iters), loss = 1.16422
I0515 10:58:45.433640 35382 solver.cpp:258] Train net output #0: loss = 0.627196 (* 1 = 0.627196 loss)
I0515 10:58:45.433701 35382 sgd_solver.cpp:112] Iteration 200, lr = 0.01
I will update the prototxt ASAP and you and try it again.
The problem is in the power layer( x/6)
what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.
The source code is right. But I find when the scale coefficient of power layer is in [0,1], the train loss will stay the same all the time. When I get rid of this "/6" layer, the loss will change. Can the network work well with your data?
In addition, I will write a swich function layer ( ReLU(x+3)/6 ) instead of the existing implementation.
I have tried this model on binary classification, but the loss is same (value 0.693147) all the time. I am sure the data is right, and have tried other depthwise conv, relu6 and bn parameter also. All methods can't work. It really works???