jixing0415 / caffe-mobilenet-v3

Caffe Implementation of MobileNets V3
MIT License
130 stars 94 forks source link

Why the loss stayed the same value? #1

Closed xtj49 closed 5 years ago

xtj49 commented 5 years ago

I have tried this model on binary classification, but the loss is same (value 0.693147) all the time. I am sure the data is right, and have tried other depthwise conv, relu6 and bn parameter also. All methods can't work. It really works???

xtj49 commented 5 years ago

The error maybe in the swish function. Because sigmoid function will work.

xtj49 commented 5 years ago

The problem is in the power layer( x/6)

lrain-CN commented 5 years ago

The problem is in the power layer( x/6)

Use an extra Eltwise PROD layer, And it should use the ReLU6 layer rather than ReLU layer in the h-swish function?

xtj49 commented 5 years ago

The problem is in the power layer( x/6)

Use an extra Eltwise PROD layer, And it should use the ReLU6 layer rather than ReLU layer in the h-swish function?

But there is a question, Eltwise layer deals with matrix and matrix , but not matrix and constant?

zhaokai5 commented 5 years ago

The problem is in the power layer( x/6)

what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.

xtj49 commented 5 years ago

The problem is in the power layer( x/6)

what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.

The source code is right. But I find when the scale coefficient of power layer is in [0,1], the train loss will stay the same all the time. When I get rid of this "/6" layer, the loss will change. Can the network work well with your data?

zhaokai5 commented 5 years ago

The problem is in the power layer( x/6)

what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.

The source code is right. But I find when the scale coefficient of power layer is in [0,1], the train loss will stay the same all the time. When I get rid of this "/6" layer, the loss will change. Can the network work well with your data?

The network not work well. The loss also stayed the same value.

jixing0415 commented 5 years ago

The problem is in the power layer( x/6)

what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.

The source code is right. But I find when the scale coefficient of power layer is in [0,1], the train loss will stay the same all the time. When I get rid of this "/6" layer, the loss will change. Can the network work well with your data?

Since my gpu is still running other tasks, I have not tried this yet. I am not sure if power layer support in-place operation. But the definition of the network structure uses the in-place operation.

jixing0415 commented 5 years ago

The problem is in the power layer( x/6)

what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.

The source code is right. But I find when the scale coefficient of power layer is in [0,1], the train loss will stay the same all the time. When I get rid of this "/6" layer, the loss will change. Can the network work well with your data?

I have tested the prototxt on MNIST dataset and got the same result. The training log is as follows:

I0515 11:05:31.873651 37636 solver.cpp:239] Iteration 0 (-nan iter/s, 26.7698s/20 iters), loss = 6.91161
I0515 11:05:31.873690 37636 solver.cpp:258]     Train net output #0: loss = 6.91161 (* 1 = 6.91161 loss)
I0515 11:05:31.873761 37636 sgd_solver.cpp:112] Iteration 0, lr = 0.01
I0515 11:05:51.320263 37636 solver.cpp:239] Iteration 20 (1.02848 iter/s, 19.4461s/20 iters), loss = 6.90501
I0515 11:05:51.320302 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:05:51.320356 37636 sgd_solver.cpp:112] Iteration 20, lr = 0.01
I0515 11:06:10.302816 37636 solver.cpp:239] Iteration 40 (1.05363 iter/s, 18.9821s/20 iters), loss = 6.90635
I0515 11:06:10.303040 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:10.303057 37636 sgd_solver.cpp:112] Iteration 40, lr = 0.01
I0515 11:06:29.307520 37636 solver.cpp:239] Iteration 60 (1.05241 iter/s, 19.004s/20 iters), loss = 6.90681
I0515 11:06:29.307559 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:29.307596 37636 sgd_solver.cpp:112] Iteration 60, lr = 0.01
I0515 11:06:48.303532 37636 solver.cpp:239] Iteration 80 (1.05288 iter/s, 18.9955s/20 iters), loss = 6.90704
I0515 11:06:48.303723 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:48.303740 37636 sgd_solver.cpp:112] Iteration 80, lr = 0.01
I0515 11:07:07.300627 37636 solver.cpp:239] Iteration 100 (1.05283 iter/s, 18.9964s/20 iters), loss = 6.90714
I0515 11:07:07.300673 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:07.300725 37636 sgd_solver.cpp:112] Iteration 100, lr = 0.01
I0515 11:07:26.307842 37636 solver.cpp:239] Iteration 120 (1.05226 iter/s, 19.0067s/20 iters), loss = 6.90776
I0515 11:07:26.308053 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:26.308068 37636 sgd_solver.cpp:112] Iteration 120, lr = 0.01
I0515 11:07:45.309257 37636 solver.cpp:239] Iteration 140 (1.05259 iter/s, 19.0007s/20 iters), loss = 6.90776
I0515 11:07:45.309298 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:45.309329 37636 sgd_solver.cpp:112] Iteration 140, lr = 0.01
I0515 11:08:04.307058 37636 solver.cpp:239] Iteration 160 (1.05278 iter/s, 18.9973s/20 iters), loss = 6.90776
I0515 11:08:04.307271 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:08:04.307287 37636 sgd_solver.cpp:112] Iteration 160, lr = 0.01
I0515 11:08:23.303742 37636 solver.cpp:239] Iteration 180 (1.05285 iter/s, 18.996s/20 iters), loss = 6.90776
I0515 11:08:23.303779 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:08:23.303833 37636 sgd_solver.cpp:112] Iteration 180, lr = 0.01
I0515 11:08:41.365075 37636 solver.cpp:347] Iteration 200, Testing net (#0)
I0515 11:08:41.365545 37636 net.cpp:678] Ignoring source layer loss
I0515 11:09:03.696059 37636 solver.cpp:414]     Test net output #0: accuracy = 0.1032
I0515 11:09:04.653638 37636 solver.cpp:239] Iteration 200 (0.48369 iter/s, 41.3488s/20 iters), loss = 6.90776
I0515 11:09:04.653683 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:09:04.653744 37636 sgd_solver.cpp:112] Iteration 200, lr = 0.01I0515 11:05:31.873651 37636 solver.cpp:239] Iteration 0 (-nan iter/s, 26.7698s/20 iters), loss = 6.91161
I0515 11:05:31.873690 37636 solver.cpp:258]     Train net output #0: loss = 6.91161 (* 1 = 6.91161 loss)
I0515 11:05:31.873761 37636 sgd_solver.cpp:112] Iteration 0, lr = 0.01
I0515 11:05:51.320263 37636 solver.cpp:239] Iteration 20 (1.02848 iter/s, 19.4461s/20 iters), loss = 6.90501
I0515 11:05:51.320302 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:05:51.320356 37636 sgd_solver.cpp:112] Iteration 20, lr = 0.01
I0515 11:06:10.302816 37636 solver.cpp:239] Iteration 40 (1.05363 iter/s, 18.9821s/20 iters), loss = 6.90635
I0515 11:06:10.303040 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:10.303057 37636 sgd_solver.cpp:112] Iteration 40, lr = 0.01
I0515 11:06:29.307520 37636 solver.cpp:239] Iteration 60 (1.05241 iter/s, 19.004s/20 iters), loss = 6.90681
I0515 11:06:29.307559 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:29.307596 37636 sgd_solver.cpp:112] Iteration 60, lr = 0.01
I0515 11:06:48.303532 37636 solver.cpp:239] Iteration 80 (1.05288 iter/s, 18.9955s/20 iters), loss = 6.90704
I0515 11:06:48.303723 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:06:48.303740 37636 sgd_solver.cpp:112] Iteration 80, lr = 0.01
I0515 11:07:07.300627 37636 solver.cpp:239] Iteration 100 (1.05283 iter/s, 18.9964s/20 iters), loss = 6.90714
I0515 11:07:07.300673 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:07.300725 37636 sgd_solver.cpp:112] Iteration 100, lr = 0.01
I0515 11:07:26.307842 37636 solver.cpp:239] Iteration 120 (1.05226 iter/s, 19.0067s/20 iters), loss = 6.90776
I0515 11:07:26.308053 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:26.308068 37636 sgd_solver.cpp:112] Iteration 120, lr = 0.01
I0515 11:07:45.309257 37636 solver.cpp:239] Iteration 140 (1.05259 iter/s, 19.0007s/20 iters), loss = 6.90776
I0515 11:07:45.309298 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:07:45.309329 37636 sgd_solver.cpp:112] Iteration 140, lr = 0.01
I0515 11:08:04.307058 37636 solver.cpp:239] Iteration 160 (1.05278 iter/s, 18.9973s/20 iters), loss = 6.90776
I0515 11:08:04.307271 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:08:04.307287 37636 sgd_solver.cpp:112] Iteration 160, lr = 0.01
I0515 11:08:23.303742 37636 solver.cpp:239] Iteration 180 (1.05285 iter/s, 18.996s/20 iters), loss = 6.90776
I0515 11:08:23.303779 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:08:23.303833 37636 sgd_solver.cpp:112] Iteration 180, lr = 0.01
I0515 11:08:41.365075 37636 solver.cpp:347] Iteration 200, Testing net (#0)
I0515 11:08:41.365545 37636 net.cpp:678] Ignoring source layer loss
I0515 11:09:03.696059 37636 solver.cpp:414]     Test net output #0: accuracy = 0.1032
I0515 11:09:04.653638 37636 solver.cpp:239] Iteration 200 (0.48369 iter/s, 41.3488s/20 iters), loss = 6.90776
I0515 11:09:04.653683 37636 solver.cpp:258]     Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0515 11:09:04.653744 37636 sgd_solver.cpp:112] Iteration 200, lr = 0.01

Then I changed the power layer to non-inplaced. The training loss converges hundreds of times and the training log is as follows:

I0515 10:55:12.849515 35382 solver.cpp:239] Iteration 0 (-nan iter/s, 27.2175s/20 iters), loss = 2.27915
I0515 10:55:12.849555 35382 solver.cpp:258]     Train net output #0: loss = 2.27915 (* 1 = 2.27915 loss)
I0515 10:55:12.849609 35382 sgd_solver.cpp:112] Iteration 0, lr = 0.01
I0515 10:55:32.186156 35382 solver.cpp:239] Iteration 20 (1.03433 iter/s, 19.3361s/20 iters), loss = 2.2192
I0515 10:55:32.186208 35382 solver.cpp:258]     Train net output #0: loss = 2.13934 (* 1 = 2.13934 loss)
I0515 10:55:32.186231 35382 sgd_solver.cpp:112] Iteration 20, lr = 0.01
I0515 10:55:51.084617 35382 solver.cpp:239] Iteration 40 (1.05832 iter/s, 18.8979s/20 iters), loss = 2.12924
I0515 10:55:51.084879 35382 solver.cpp:258]     Train net output #0: loss = 2.03532 (* 1 = 2.03532 loss)
I0515 10:55:51.084897 35382 sgd_solver.cpp:112] Iteration 40, lr = 0.01
I0515 10:56:09.980957 35382 solver.cpp:239] Iteration 60 (1.05845 iter/s, 18.8956s/20 iters), loss = 2.04191
I0515 10:56:09.980991 35382 solver.cpp:258]     Train net output #0: loss = 1.8054 (* 1 = 1.8054 loss)
I0515 10:56:09.981035 35382 sgd_solver.cpp:112] Iteration 60, lr = 0.01
I0515 10:56:28.875658 35382 solver.cpp:239] Iteration 80 (1.05853 iter/s, 18.8942s/20 iters), loss = 1.98364
I0515 10:56:28.875864 35382 solver.cpp:258]     Train net output #0: loss = 1.91929 (* 1 = 1.91929 loss)
I0515 10:56:28.875878 35382 sgd_solver.cpp:112] Iteration 80, lr = 0.01
I0515 10:56:47.791654 35382 solver.cpp:239] Iteration 100 (1.05735 iter/s, 18.9153s/20 iters), loss = 1.9244
I0515 10:56:47.791693 35382 solver.cpp:258]     Train net output #0: loss = 1.8284 (* 1 = 1.8284 loss)
I0515 10:56:47.791745 35382 sgd_solver.cpp:112] Iteration 100, lr = 0.01
I0515 10:57:06.693739 35382 solver.cpp:239] Iteration 120 (1.05811 iter/s, 18.9016s/20 iters), loss = 1.7903
I0515 10:57:06.693966 35382 solver.cpp:258]     Train net output #0: loss = 1.4549 (* 1 = 1.4549 loss)
I0515 10:57:06.693981 35382 sgd_solver.cpp:112] Iteration 120, lr = 0.01
I0515 10:57:25.593870 35382 solver.cpp:239] Iteration 140 (1.05823 iter/s, 18.8994s/20 iters), loss = 1.66322
I0515 10:57:25.593910 35382 solver.cpp:258]     Train net output #0: loss = 1.29358 (* 1 = 1.29358 loss)
I0515 10:57:25.593968 35382 sgd_solver.cpp:112] Iteration 140, lr = 0.01
I0515 10:57:44.503170 35382 solver.cpp:239] Iteration 160 (1.05771 iter/s, 18.9088s/20 iters), loss = 1.52208
I0515 10:57:44.503396 35382 solver.cpp:258]     Train net output #0: loss = 1.03544 (* 1 = 1.03544 loss)
I0515 10:57:44.503412 35382 sgd_solver.cpp:112] Iteration 160, lr = 0.01
I0515 10:58:03.408309 35382 solver.cpp:258]     Train net output #0: loss = 0.836138 (* 1 = 0.836138 loss)
I0515 10:58:03.408349 35382 sgd_solver.cpp:112] Iteration 180, lr = 0.01
I0515 10:58:21.384106 35382 solver.cpp:347] Iteration 200, Testing net (#0)
I0515 10:58:21.384551 35382 net.cpp:678] Ignoring source layer loss
I0515 10:58:44.481245 35382 solver.cpp:414]     Test net output #0: accuracy = 0.1787
I0515 10:58:45.433612 35382 solver.cpp:239] Iteration 200 (0.475915 iter/s, 42.0243s/20 iters), loss = 1.16422
I0515 10:58:45.433640 35382 solver.cpp:258]     Train net output #0: loss = 0.627196 (* 1 = 0.627196 loss)
I0515 10:58:45.433701 35382 sgd_solver.cpp:112] Iteration 200, lr = 0.01

I will update the prototxt ASAP and you and try it again.

jixing0415 commented 5 years ago

The problem is in the power layer( x/6)

what's the problem in power layer(x/6)? From the source code, the power layer do y = (shift + scale * x)^power operation.

The source code is right. But I find when the scale coefficient of power layer is in [0,1], the train loss will stay the same all the time. When I get rid of this "/6" layer, the loss will change. Can the network work well with your data?

In addition, I will write a swich function layer ( ReLU(x+3)/6 ) instead of the existing implementation.