kmaninis / OSVOS-caffe

One-Shot Video Object Segmentation
http://vision.ee.ethz.ch/~cvlsegmentation/osvos/
GNU General Public License v3.0
171 stars 67 forks source link

Parent network training, loss = nan ??? #5

Closed zhujingsong closed 6 years ago

zhujingsong commented 6 years ago

I use the Titan XP to train the parent model. I tried base_lr = 1e-8(default) / 1e-9, but the loss always fluctuates ups and downs, and it became 'nan' at some iteration. Could you tell me how to tackle the problem? I would really appreciate it if you could give a reply.

I0321 21:52:42.315754 11354 solver.cpp:331] Iteration 0, Testing net (#0) I0321 21:52:44.937382 11354 solver.cpp:219] Iteration 0 (0 iter/s, 2.62483s/20 iters), loss = 116012 I0321 21:52:44.937474 11354 solver.cpp:238] Train net output #0: dsn2_loss = 45369.6 ( 1 = 45369.6 loss) I0321 21:52:44.937501 11354 solver.cpp:238] Train net output #1: dsn3_loss = 45369.6 ( 1 = 45369.6 loss) I0321 21:52:44.937510 11354 solver.cpp:238] Train net output #2: dsn4_loss = 45369.6 ( 1 = 45369.6 loss) I0321 21:52:44.937517 11354 solver.cpp:238] Train net output #3: dsn5_loss = 45369.6 ( 1 = 45369.6 loss) I0321 21:52:44.937525 11354 solver.cpp:238] Train net output #4: fuse_loss = 45369.6 ( 1 = 45369.6 loss) I0321 21:52:44.937536 11354 sgd_solver.cpp:105] Iteration 0, lr = 1e-09 I0321 21:53:38.136545 11354 solver.cpp:219] Iteration 20 (0.375968 iter/s, 53.196s/20 iters), loss = 132717 I0321 21:53:38.136634 11354 solver.cpp:238] Train net output #0: dsn2_loss = 30310.2 ( 1 = 30310.2 loss) I0321 21:53:38.136662 11354 solver.cpp:238] Train net output #1: dsn3_loss = 29715.9 ( 1 = 29715.9 loss) I0321 21:53:38.136678 11354 solver.cpp:238] Train net output #2: dsn4_loss = 30441.8 ( 1 = 30441.8 loss) I0321 21:53:38.136693 11354 solver.cpp:238] Train net output #3: dsn5_loss = 30483.2 ( 1 = 30483.2 loss) I0321 21:53:38.136708 11354 solver.cpp:238] Train net output #4: fuse_loss = 30387.5 ( 1 = 30387.5 loss) I0321 21:53:38.136724 11354 sgd_solver.cpp:105] Iteration 20, lr = 1e-09 I0321 21:54:30.748098 11354 solver.cpp:219] Iteration 40 (0.380165 iter/s, 52.6087s/20 iters), loss = 107836 I0321 21:54:30.748172 11354 solver.cpp:238] Train net output #0: dsn2_loss = 19628.8 ( 1 = 19628.8 loss) I0321 21:54:30.748181 11354 solver.cpp:238] Train net output #1: dsn3_loss = 19075.7 ( 1 = 19075.7 loss) I0321 21:54:30.748188 11354 solver.cpp:238] Train net output #2: dsn4_loss = 19764.4 ( 1 = 19764.4 loss) I0321 21:54:30.748196 11354 solver.cpp:238] Train net output #3: dsn5_loss = 19837 ( 1 = 19837 loss) I0321 21:54:30.748219 11354 solver.cpp:238] Train net output #4: fuse_loss = 19558.3 ( 1 = 19558.3 loss) I0321 21:54:30.748229 11354 sgd_solver.cpp:105] Iteration 40, lr = 1e-09 I0321 21:55:20.153046 11354 solver.cpp:219] Iteration 60 (0.404839 iter/s, 49.4024s/20 iters), loss = 99894 I0321 21:55:20.153127 11354 solver.cpp:238] Train net output #0: dsn2_loss = 10907.5 ( 1 = 10907.5 loss) I0321 21:55:20.153141 11354 solver.cpp:238] Train net output #1: dsn3_loss = 9222.18 ( 1 = 9222.18 loss) I0321 21:55:20.153151 11354 solver.cpp:238] Train net output #2: dsn4_loss = 11842.5 ( 1 = 11842.5 loss) I0321 21:55:20.153169 11354 solver.cpp:238] Train net output #3: dsn5_loss = 11927.8 ( 1 = 11927.8 loss) I0321 21:55:20.153182 11354 solver.cpp:238] Train net output #4: fuse_loss = 10795.1 ( 1 = 10795.1 loss) I0321 21:55:20.153194 11354 sgd_solver.cpp:105] Iteration 60, lr = 1e-09 I0321 21:56:07.384881 11354 solver.cpp:219] Iteration 80 (0.423465 iter/s, 47.2294s/20 iters), loss = 44943.5 I0321 21:56:07.384968 11354 solver.cpp:238] Train net output #0: dsn2_loss = 17211.5 ( 1 = 17211.5 loss) I0321 21:56:07.384996 11354 solver.cpp:238] Train net output #1: dsn3_loss = 11019.4 ( 1 = 11019.4 loss) I0321 21:56:07.385004 11354 solver.cpp:238] Train net output #2: dsn4_loss = 18592.5 ( 1 = 18592.5 loss) I0321 21:56:07.385011 11354 solver.cpp:238] Train net output #3: dsn5_loss = 18892.4 ( 1 = 18892.4 loss) I0321 21:56:07.385018 11354 solver.cpp:238] Train net output #4: fuse_loss = 14495.2 ( 1 = 14495.2 loss) I0321 21:56:07.385027 11354 sgd_solver.cpp:105] Iteration 80, lr = 1e-09 I0321 21:56:54.658689 11354 solver.cpp:219] Iteration 100 (0.423089 iter/s, 47.2714s/20 iters), loss = 67899.7 I0321 21:56:54.658788 11354 solver.cpp:238] Train net output #0: dsn2_loss = 14982.7 ( 1 = 14982.7 loss) I0321 21:56:54.658818 11354 solver.cpp:238] Train net output #1: dsn3_loss = 10885.6 ( 1 = 10885.6 loss) I0321 21:56:54.658833 11354 solver.cpp:238] Train net output #2: dsn4_loss = 15006.9 ( 1 = 15006.9 loss) I0321 21:56:54.658844 11354 solver.cpp:238] Train net output #3: dsn5_loss = 15270.3 ( 1 = 15270.3 loss) I0321 21:56:54.658861 11354 solver.cpp:238] Train net output #4: fuse_loss = 12647.6 ( 1 = 12647.6 loss) I0321 21:56:54.658876 11354 sgd_solver.cpp:105] Iteration 100, lr = 1e-09 I0321 21:57:43.779109 11354 solver.cpp:219] Iteration 120 (0.407183 iter/s, 49.118s/20 iters), loss = 63917.8 I0321 21:57:43.779191 11354 solver.cpp:238] Train net output #0: dsn2_loss = 9679.94 ( 1 = 9679.94 loss) I0321 21:57:43.779199 11354 solver.cpp:238] Train net output #1: dsn3_loss = 4138.92 ( 1 = 4138.92 loss) I0321 21:57:43.779206 11354 solver.cpp:238] Train net output #2: dsn4_loss = 9165.52 ( 1 = 9165.52 loss) I0321 21:57:43.779215 11354 solver.cpp:238] Train net output #3: dsn5_loss = 11711.8 ( 1 = 11711.8 loss) I0321 21:57:43.779238 11354 solver.cpp:238] Train net output #4: fuse_loss = 5715.27 ( 1 = 5715.27 loss) I0321 21:57:43.779247 11354 sgd_solver.cpp:105] Iteration 120, lr = 1e-09 I0321 21:58:35.024227 11354 solver.cpp:219] Iteration 140 (0.3903 iter/s, 51.2426s/20 iters), loss = 87141.5 I0321 21:58:35.024309 11354 solver.cpp:238] Train net output #0: dsn2_loss = 4843.74 ( 1 = 4843.74 loss) I0321 21:58:35.024336 11354 solver.cpp:238] Train net output #1: dsn3_loss = 2672.27 ( 1 = 2672.27 loss) I0321 21:58:35.024345 11354 solver.cpp:238] Train net output #2: dsn4_loss = 4293.57 ( 1 = 4293.57 loss) I0321 21:58:35.024353 11354 solver.cpp:238] Train net output #3: dsn5_loss = 5528.3 ( 1 = 5528.3 loss) I0321 21:58:35.024363 11354 solver.cpp:238] Train net output #4: fuse_loss = 3142 ( 1 = 3142 loss) I0321 21:58:35.024382 11354 sgd_solver.cpp:105] Iteration 140, lr = 1e-09 I0321 21:59:14.739044 11354 solver.cpp:219] Iteration 160 (0.503615 iter/s, 39.7129s/20 iters), loss = nan I0321 21:59:14.739117 11354 solver.cpp:238] Train net output #0: dsn2_loss = nan ( 1 = nan loss) I0321 21:59:14.739128 11354 solver.cpp:238] Train net output #1: dsn3_loss = nan ( 1 = nan loss) I0321 21:59:14.739141 11354 solver.cpp:238] Train net output #2: dsn4_loss = nan ( 1 = nan loss) I0321 21:59:14.739157 11354 solver.cpp:238] Train net output #3: dsn5_loss = nan ( 1 = nan loss) I0321 21:59:14.739171 11354 solver.cpp:238] Train net output #4: fuse_loss = nan ( 1 = nan loss) I0321 21:59:14.739183 11354 sgd_solver.cpp:105] Iteration 160, lr = 1e-09 I0321 21:59:55.145210 11354 solver.cpp:219] Iteration 180 (0.494997 iter/s, 40.4043s/20 iters), loss = nan I0321 21:59:55.145300 11354 solver.cpp:238] Train net output #0: dsn2_loss = nan ( 1 = nan loss) I0321 21:59:55.145318 11354 solver.cpp:238] Train net output #1: dsn3_loss = nan ( 1 = nan loss) I0321 21:59:55.145329 11354 solver.cpp:238] Train net output #2: dsn4_loss = nan ( 1 = nan loss) I0321 21:59:55.145340 11354 solver.cpp:238] Train net output #3: dsn5_loss = nan ( 1 = nan loss) I0321 21:59:55.145361 11354 solver.cpp:238] Train net output #4: fuse_loss = nan ( 1 = nan loss) I0321 21:59:55.145376 11354 sgd_solver.cpp:105] Iteration 180, lr = 1e-09 I0321 22:00:36.563983 11354 solver.cpp:219] Iteration 200 (0.482895 iter/s, 41.4168s/20 iters), loss = nan I0321 22:00:36.564069 11354 solver.cpp:238] Train net output #0: dsn2_loss = nan ( 1 = nan loss) I0321 22:00:36.564098 11354 solver.cpp:238] Train net output #1: dsn3_loss = nan ( 1 = nan loss) I0321 22:00:36.564111 11354 solver.cpp:238] Train net output #2: dsn4_loss = nan ( 1 = nan loss) I0321 22:00:36.564124 11354 solver.cpp:238] Train net output #3: dsn5_loss = nan ( 1 = nan loss) I0321 22:00:36.564138 11354 solver.cpp:238] Train net output #4: fuse_loss = nan ( 1 = nan loss) I0321 22:00:36.564154 11354 sgd_solver.cpp:105] Iteration 200, lr = 1e-09 I0321 22:01:18.799461 11354 solver.cpp:219] Iteration 220 (0.473557 iter/s, 42.2335s/20 iters), loss = nan I0321 22:01:18.799545 11354 solver.cpp:238] Train net output #0: dsn2_loss = nan ( 1 = nan loss) I0321 22:01:18.799561 11354 solver.cpp:238] Train net output #1: dsn3_loss = nan ( 1 = nan loss) I0321 22:01:18.799572 11354 solver.cpp:238] Train net output #2: dsn4_loss = nan ( 1 = nan loss) I0321 22:01:18.799583 11354 solver.cpp:238] Train net output #3: dsn5_loss = nan ( 1 = nan loss) I0321 22:01:18.799597 11354 solver.cpp:238] Train net output #4: fuse_loss = nan ( 1 = nan loss) I0321 22:01:18.799616 11354 sgd_solver.cpp:105] Iteration 220, lr = 1e-09 I0321 22:01:59.521841 11354 solver.cpp:219] Iteration 240 (0.491153 iter/s, 40.7205s/20 iters), loss = nan I0321 22:01:59.521919 11354 solver.cpp:238] Train net output #0: dsn2_loss = nan ( 1 = nan loss) I0321 22:01:59.521931 11354 solver.cpp:238] Train net output #1: dsn3_loss = nan ( 1 = nan loss) I0321 22:01:59.521968 11354 solver.cpp:238] Train net output #2: dsn4_loss = nan ( 1 = nan loss) I0321 22:01:59.522012 11354 solver.cpp:238] Train net output #3: dsn5_loss = nan ( 1 = nan loss) I0321 22:01:59.522033 11354 solver.cpp:238] Train net output #4: fuse_loss = nan ( 1 = nan loss) I0321 22:01:59.522073 11354 sgd_solver.cpp:105] Iteration 240, lr = 1e-09 I0321 22:02:42.450592 11354 solver.cpp:219] Iteration 260 (0.465909 iter/s, 42.9268s/20 iters), loss = nan I0321 22:02:42.450680 11354 solver.cpp:238] Train net output #0: dsn2_loss = nan ( 1 = nan loss) I0321 22:02:42.450693 11354 solver.cpp:238] Train net output #1: dsn3_loss = nan ( 1 = nan loss) I0321 22:02:42.450713 11354 solver.cpp:238] Train net output #2: dsn4_loss = nan ( 1 = nan loss) I0321 22:02:42.450726 11354 solver.cpp:238] Train net output #3: dsn5_loss = nan ( 1 = nan loss) I0321 22:02:42.450744 11354 solver.cpp:238] Train net output #4: fuse_loss = nan ( 1 = nan loss) I0321 22:02:42.450754 11354 sgd_solver.cpp:105] Iteration 260, lr = 1e-09 I0321 22:03:24.126384 11354 solver.cpp:219] Iteration 280 (0.479916 iter/s, 41.6739s/20 iters), loss = nan I0321 22:03:24.126468 11354 solver.cpp:238] Train net output #0: dsn2_loss = nan ( 1 = nan loss) I0321 22:03:24.126497 11354 solver.cpp:238] Train net output #1: dsn3_loss = nan ( 1 = nan loss) I0321 22:03:24.126510 11354 solver.cpp:238] Train net output #2: dsn4_loss = nan ( 1 = nan loss) I0321 22:03:24.126528 11354 solver.cpp:238] Train net output #3: dsn5_loss = nan ( 1 = nan loss) I0321 22:03:24.126543 11354 solver.cpp:238] Train net output #4: fuse_loss = nan ( 1 = nan loss) I0321 22:03:24.126565 11354 sgd_solver.cpp:105] Iteration 280, lr = 1e-09 I0321 22:04:02.832152 11354 solver.cpp:219] Iteration 300 (0.516742 iter/s, 38.7041s/20 iters), loss = nan I0321 22:04:02.832226 11354 solver.cpp:238] Train net output #0: dsn2_loss = nan ( 1 = nan loss) I0321 22:04:02.832234 11354 solver.cpp:238] Train net output #1: dsn3_loss = nan ( 1 = nan loss) I0321 22:04:02.832247 11354 solver.cpp:238] Train net output #2: dsn4_loss = nan ( 1 = nan loss) I0321 22:04:02.832253 11354 solver.cpp:238] Train net output #3: dsn5_loss = nan ( 1 = nan loss) I0321 22:04:02.832259 11354 solver.cpp:238] Train net output #4: fuse_loss = nan ( 1 = nan loss) I0321 22:04:02.832267 11354 sgd_solver.cpp:105] Iteration 300, lr = 1e-09 I0321 22:04:43.423032 11354 solver.cpp:219] Iteration 320 (0.492743 iter/s, 40.5891s/20 iters), loss = nan I0321 22:04:43.423207 11354 solver.cpp:238] Train net output #0: dsn2_loss = nan ( 1 = nan loss) I0321 22:04:43.423264 11354 solver.cpp:238] Train net output #1: dsn3_loss = nan ( 1 = nan loss) I0321 22:04:43.423312 11354 solver.cpp:238] Train net output #2: dsn4_loss = nan ( 1 = nan loss) I0321 22:04:43.423359 11354 solver.cpp:238] Train net output #3: dsn5_loss = nan ( 1 = nan loss) I0321 22:04:43.423405 11354 solver.cpp:238] Train net output #4: fuse_loss = nan ( 1 = nan loss) I0321 22:04:43.423454 11354 sgd_solver.cpp:105] Iteration 320, lr = 1e-09 I0321 22:05:22.978026 11354 solver.cpp:219] Iteration 340 (0.505648 iter/s, 39.5532s/20 iters), loss = nan I0321 22:05:22.978121 11354 solver.cpp:238] Train net output #0: dsn2_loss = nan ( 1 = nan loss) I0321 22:05:22.978135 11354 solver.cpp:238] Train net output #1: dsn3_loss = nan ( 1 = nan loss) I0321 22:05:22.978149 11354 solver.cpp:238] Train net output #2: dsn4_loss = nan ( 1 = nan loss) I0321 22:05:22.978162 11354 solver.cpp:238] Train net output #3: dsn5_loss = nan ( 1 = nan loss) I0321 22:05:22.978175 11354 solver.cpp:238] Train net output #4: fuse_loss = nan ( 1 = nan loss) I0321 22:05:22.978186 11354 sgd_solver.cpp:105] Iteration 340, lr = 1e-09 I0321 22:06:02.740044 11354 solver.cpp:219] Iteration 360 (0.503014 iter/s, 39.7603s/20 iters), loss = nan I0321 22:06:02.740128 11354 solver.cpp:238] Train net output #0: dsn2_loss = nan ( 1 = nan loss) I0321 22:06:02.740154 11354 solver.cpp:238] Train net output #1: dsn3_loss = nan ( 1 = nan loss) I0321 22:06:02.740160 11354 solver.cpp:238] Train net output #2: dsn4_loss = nan ( 1 = nan loss) I0321 22:06:02.740169 11354 solver.cpp:238] Train net output #3: dsn5_loss = nan ( 1 = nan loss) I0321 22:06:02.740175 11354 solver.cpp:238] Train net output #4: fuse_loss = nan (* 1 = nan loss)

kmaninis commented 6 years ago

Hi, does this happen when trying to train from DAVIS dataset?

zhujingsong commented 6 years ago

Yes, I download the DAVIS 2016 dataset, augment it just like 'trainpairs.txt'. Then add 'DATA_DIR_ROOT', and start training, following the steps of 'readme'. I am trying to check what goes wrong......

zhujingsong commented 6 years ago

the training process is really tricky, i follow the steps, now the problem is that the loss is always fluctuating from 20000~30000, like this. Do you have any ideas why is that?

What does the loss look like when you are training?

I0322 22:55:30.170424 19131 solver.cpp:219] Iteration 4800 (0.390295 iter/s, 51.2433s/20 iters), loss = 31645.1 I0322 22:55:30.170526 19131 solver.cpp:238] Train net output #0: dsn2_loss = 6840.05 ( 1 = 6840.05 loss) I0322 22:55:30.170538 19131 solver.cpp:238] Train net output #1: dsn3_loss = 3939.92 ( 1 = 3939.92 loss) I0322 22:55:30.170545 19131 solver.cpp:238] Train net output #2: dsn4_loss = 2079.14 ( 1 = 2079.14 loss) I0322 22:55:30.170553 19131 solver.cpp:238] Train net output #3: dsn5_loss = 1395.85 ( 1 = 1395.85 loss) I0322 22:55:30.170562 19131 solver.cpp:238] Train net output #4: fuse_loss = 1333.71 ( 1 = 1333.71 loss) I0322 22:55:30.170588 19131 sgd_solver.cpp:105] Iteration 4800, lr = 1e-08 I0322 22:56:15.607455 19131 solver.cpp:219] Iteration 4820 (0.440192 iter/s, 45.4347s/20 iters), loss = 28704.2 I0322 22:56:15.607555 19131 solver.cpp:238] Train net output #0: dsn2_loss = 4781.82 ( 1 = 4781.82 loss) I0322 22:56:15.607583 19131 solver.cpp:238] Train net output #1: dsn3_loss = 2303.21 ( 1 = 2303.21 loss) I0322 22:56:15.607591 19131 solver.cpp:238] Train net output #2: dsn4_loss = 1503.31 ( 1 = 1503.31 loss) I0322 22:56:15.607599 19131 solver.cpp:238] Train net output #3: dsn5_loss = 1911.06 ( 1 = 1911.06 loss) I0322 22:56:15.607606 19131 solver.cpp:238] Train net output #4: fuse_loss = 1549.09 ( 1 = 1549.09 loss) I0322 22:56:15.607615 19131 sgd_solver.cpp:105] Iteration 4820, lr = 1e-08 I0322 22:57:04.708586 19131 solver.cpp:219] Iteration 4840 (0.407343 iter/s, 49.0987s/20 iters), loss = 17773.7 I0322 22:57:04.708680 19131 solver.cpp:238] Train net output #0: dsn2_loss = 8524.32 ( 1 = 8524.32 loss) I0322 22:57:04.708704 19131 solver.cpp:238] Train net output #1: dsn3_loss = 4778.96 ( 1 = 4778.96 loss) I0322 22:57:04.708711 19131 solver.cpp:238] Train net output #2: dsn4_loss = 3759.5 ( 1 = 3759.5 loss) I0322 22:57:04.708719 19131 solver.cpp:238] Train net output #3: dsn5_loss = 2714.62 ( 1 = 2714.62 loss) I0322 22:57:04.708726 19131 solver.cpp:238] Train net output #4: fuse_loss = 2164.28 ( 1 = 2164.28 loss) I0322 22:57:04.708735 19131 sgd_solver.cpp:105] Iteration 4840, lr = 1e-08 I0322 22:57:55.019158 19131 solver.cpp:219] Iteration 4860 (0.39755 iter/s, 50.3081s/20 iters), loss = 17791.3 I0322 22:57:55.019253 19131 solver.cpp:238] Train net output #0: dsn2_loss = 6271.85 ( 1 = 6271.85 loss) I0322 22:57:55.019265 19131 solver.cpp:238] Train net output #1: dsn3_loss = 4078.07 ( 1 = 4078.07 loss) I0322 22:57:55.019287 19131 solver.cpp:238] Train net output #2: dsn4_loss = 1559.87 ( 1 = 1559.87 loss) I0322 22:57:55.019295 19131 solver.cpp:238] Train net output #3: dsn5_loss = 1797.01 ( 1 = 1797.01 loss) I0322 22:57:55.019304 19131 solver.cpp:238] Train net output #4: fuse_loss = 1200.79 ( 1 = 1200.79 loss) I0322 22:57:55.019312 19131 sgd_solver.cpp:105] Iteration 4860, lr = 1e-08 I0322 22:58:42.910308 19131 solver.cpp:219] Iteration 4880 (0.417634 iter/s, 47.8888s/20 iters), loss = 26082.1 I0322 22:58:42.910398 19131 solver.cpp:238] Train net output #0: dsn2_loss = 310.295 ( 1 = 310.295 loss) I0322 22:58:42.910411 19131 solver.cpp:238] Train net output #1: dsn3_loss = 161.835 ( 1 = 161.835 loss) I0322 22:58:42.910434 19131 solver.cpp:238] Train net output #2: dsn4_loss = 101.419 ( 1 = 101.419 loss) I0322 22:58:42.910454 19131 solver.cpp:238] Train net output #3: dsn5_loss = 145.774 ( 1 = 145.774 loss) I0322 22:58:42.910462 19131 solver.cpp:238] Train net output #4: fuse_loss = 162.083 ( 1 = 162.083 loss) I0322 22:58:42.910472 19131 sgd_solver.cpp:105] Iteration 4880, lr = 1e-08 I0322 22:59:31.013831 19131 solver.cpp:219] Iteration 4900 (0.41579 iter/s, 48.1013s/20 iters), loss = 18593.2 I0322 22:59:31.013931 19131 solver.cpp:238] Train net output #0: dsn2_loss = 12270.6 ( 1 = 12270.6 loss) I0322 22:59:31.013960 19131 solver.cpp:238] Train net output #1: dsn3_loss = 5063.43 ( 1 = 5063.43 loss) I0322 22:59:31.013989 19131 solver.cpp:238] Train net output #2: dsn4_loss = 1855.32 ( 1 = 1855.32 loss) I0322 22:59:31.014012 19131 solver.cpp:238] Train net output #3: dsn5_loss = 1710.37 ( 1 = 1710.37 loss) I0322 22:59:31.014035 19131 solver.cpp:238] Train net output #4: fuse_loss = 1557.83 ( 1 = 1557.83 loss) I0322 22:59:31.014050 19131 sgd_solver.cpp:105] Iteration 4900, lr = 1e-08 I0322 23:00:19.233353 19131 solver.cpp:219] Iteration 4920 (0.414789 iter/s, 48.2173s/20 iters), loss = 19357.4 I0322 23:00:19.233439 19131 solver.cpp:238] Train net output #0: dsn2_loss = 3033.52 ( 1 = 3033.52 loss) I0322 23:00:19.233469 19131 solver.cpp:238] Train net output #1: dsn3_loss = 2056.26 ( 1 = 2056.26 loss) I0322 23:00:19.233476 19131 solver.cpp:238] Train net output #2: dsn4_loss = 1136.81 ( 1 = 1136.81 loss) I0322 23:00:19.233484 19131 solver.cpp:238] Train net output #3: dsn5_loss = 1568.95 ( 1 = 1568.95 loss) I0322 23:00:19.233491 19131 solver.cpp:238] Train net output #4: fuse_loss = 1041.03 (* 1 = 1041.03 loss) I0322 23:00:19.233499 19131 sgd_solver.cpp:105] Iteration 4920, lr = 1e-08

kmaninis commented 6 years ago

The reason for the fluctuating loss is that we do not normalize the loss by the size of the augmented image. So if there are many small scale images in a minibatch, the loss is going to be smaller. In case you adjust this, keep in mind that the learning rate needs to be adjusted accordingly.

Having said that, I cannot reproduce your nan error. Was there any bug that now is fixed?

MathLee commented 6 years ago

Hi, you should focus on the Train net output #4: fuse_loss. This loss is the final mask map of the input frame, you can find the framework from the ~/OSVOS-caffe-master/src/parent/solvers/train_val_step1.prototxt. And looking at your training process, the fuse_loss is declining generally. And it is always fluctuating from hundreds to may be one or two thousand.

At 2018-03-22 23:02:49, "zhujingsong" notifications@github.com wrote:

the training process is really tricky, i follow the steps, now the problem is that the loss is always fluctuating from 20000~30000, like this. Do you have any ideas why is that?

What does the loss look like when you are training?

I0322 22:55:30.170424 19131 solver.cpp:219] Iteration 4800 (0.390295 iter/s, 51.2433s/20 iters), loss = 31645.1 I0322 22:55:30.170526 19131 solver.cpp:238] Train net output #0: dsn2_loss = 6840.05 ( 1 = 6840.05 loss) I0322 22:55:30.170538 19131 solver.cpp:238] Train net output #1: dsn3_loss = 3939.92 ( 1 = 3939.92 loss) I0322 22:55:30.170545 19131 solver.cpp:238] Train net output #2: dsn4_loss = 2079.14 ( 1 = 2079.14 loss) I0322 22:55:30.170553 19131 solver.cpp:238] Train net output #3: dsn5_loss = 1395.85 ( 1 = 1395.85 loss) I0322 22:55:30.170562 19131 solver.cpp:238] Train net output #4: fuse_loss = 1333.71 ( 1 = 1333.71 loss) I0322 22:55:30.170588 19131 sgd_solver.cpp:105] Iteration 4800, lr = 1e-08 I0322 22:56:15.607455 19131 solver.cpp:219] Iteration 4820 (0.440192 iter/s, 45.4347s/20 iters), loss = 28704.2 I0322 22:56:15.607555 19131 solver.cpp:238] Train net output #0: dsn2_loss = 4781.82 ( 1 = 4781.82 loss) I0322 22:56:15.607583 19131 solver.cpp:238] Train net output #1: dsn3_loss = 2303.21 ( 1 = 2303.21 loss) I0322 22:56:15.607591 19131 solver.cpp:238] Train net output #2: dsn4_loss = 1503.31 ( 1 = 1503.31 loss) I0322 22:56:15.607599 19131 solver.cpp:238] Train net output #3: dsn5_loss = 1911.06 ( 1 = 1911.06 loss) I0322 22:56:15.607606 19131 solver.cpp:238] Train net output #4: fuse_loss = 1549.09 ( 1 = 1549.09 loss) I0322 22:56:15.607615 19131 sgd_solver.cpp:105] Iteration 4820, lr = 1e-08 I0322 22:57:04.708586 19131 solver.cpp:219] Iteration 4840 (0.407343 iter/s, 49.0987s/20 iters), loss = 17773.7 I0322 22:57:04.708680 19131 solver.cpp:238] Train net output #0: dsn2_loss = 8524.32 ( 1 = 8524.32 loss) I0322 22:57:04.708704 19131 solver.cpp:238] Train net output #1: dsn3_loss = 4778.96 ( 1 = 4778.96 loss) I0322 22:57:04.708711 19131 solver.cpp:238] Train net output #2: dsn4_loss = 3759.5 ( 1 = 3759.5 loss) I0322 22:57:04.708719 19131 solver.cpp:238] Train net output #3: dsn5_loss = 2714.62 ( 1 = 2714.62 loss) I0322 22:57:04.708726 19131 solver.cpp:238] Train net output #4: fuse_loss = 2164.28 ( 1 = 2164.28 loss) I0322 22:57:04.708735 19131 sgd_solver.cpp:105] Iteration 4840, lr = 1e-08 I0322 22:57:55.019158 19131 solver.cpp:219] Iteration 4860 (0.39755 iter/s, 50.3081s/20 iters), loss = 17791.3 I0322 22:57:55.019253 19131 solver.cpp:238] Train net output #0: dsn2_loss = 6271.85 ( 1 = 6271.85 loss) I0322 22:57:55.019265 19131 solver.cpp:238] Train net output #1: dsn3_loss = 4078.07 ( 1 = 4078.07 loss) I0322 22:57:55.019287 19131 solver.cpp:238] Train net output #2: dsn4_loss = 1559.87 ( 1 = 1559.87 loss) I0322 22:57:55.019295 19131 solver.cpp:238] Train net output #3: dsn5_loss = 1797.01 ( 1 = 1797.01 loss) I0322 22:57:55.019304 19131 solver.cpp:238] Train net output #4: fuse_loss = 1200.79 ( 1 = 1200.79 loss) I0322 22:57:55.019312 19131 sgd_solver.cpp:105] Iteration 4860, lr = 1e-08 I0322 22:58:42.910308 19131 solver.cpp:219] Iteration 4880 (0.417634 iter/s, 47.8888s/20 iters), loss = 26082.1 I0322 22:58:42.910398 19131 solver.cpp:238] Train net output #0: dsn2_loss = 310.295 ( 1 = 310.295 loss) I0322 22:58:42.910411 19131 solver.cpp:238] Train net output #1: dsn3_loss = 161.835 ( 1 = 161.835 loss) I0322 22:58:42.910434 19131 solver.cpp:238] Train net output #2: dsn4_loss = 101.419 ( 1 = 101.419 loss) I0322 22:58:42.910454 19131 solver.cpp:238] Train net output #3: dsn5_loss = 145.774 ( 1 = 145.774 loss) I0322 22:58:42.910462 19131 solver.cpp:238] Train net output #4: fuse_loss = 162.083 ( 1 = 162.083 loss) I0322 22:58:42.910472 19131 sgd_solver.cpp:105] Iteration 4880, lr = 1e-08 I0322 22:59:31.013831 19131 solver.cpp:219] Iteration 4900 (0.41579 iter/s, 48.1013s/20 iters), loss = 18593.2 I0322 22:59:31.013931 19131 solver.cpp:238] Train net output #0: dsn2_loss = 12270.6 ( 1 = 12270.6 loss) I0322 22:59:31.013960 19131 solver.cpp:238] Train net output #1: dsn3_loss = 5063.43 ( 1 = 5063.43 loss) I0322 22:59:31.013989 19131 solver.cpp:238] Train net output #2: dsn4_loss = 1855.32 ( 1 = 1855.32 loss) I0322 22:59:31.014012 19131 solver.cpp:238] Train net output #3: dsn5_loss = 1710.37 ( 1 = 1710.37 loss) I0322 22:59:31.014035 19131 solver.cpp:238] Train net output #4: fuse_loss = 1557.83 ( 1 = 1557.83 loss) I0322 22:59:31.014050 19131 sgd_solver.cpp:105] Iteration 4900, lr = 1e-08 I0322 23:00:19.233353 19131 solver.cpp:219] Iteration 4920 (0.414789 iter/s, 48.2173s/20 iters), loss = 19357.4 I0322 23:00:19.233439 19131 solver.cpp:238] Train net output #0: dsn2_loss = 3033.52 ( 1 = 3033.52 loss) I0322 23:00:19.233469 19131 solver.cpp:238] Train net output #1: dsn3_loss = 2056.26 ( 1 = 2056.26 loss) I0322 23:00:19.233476 19131 solver.cpp:238] Train net output #2: dsn4_loss = 1136.81 ( 1 = 1136.81 loss) I0322 23:00:19.233484 19131 solver.cpp:238] Train net output #3: dsn5_loss = 1568.95 ( 1 = 1568.95 loss) I0322 23:00:19.233491 19131 solver.cpp:238] Train net output #4: fuse_loss = 1041.03 (* 1 = 1041.03 loss) I0322 23:00:19.233499 19131 sgd_solver.cpp:105] Iteration 4920, lr = 1e-08

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.