Model collapsed，why？

robotzheng commented 4 years ago

Evaluating ... Eval batch 0 - 4 ... Eval batch 4 - 8 ... Eval batch 8 - 12 ... Eval batch 12 - 16 ... Eval batch 16 - 20 ... Eval PSNR: [30.5138], MSE: [0.00111861] cost 278.3528826236725s. 2019-10-30 19:45:50 Step:30020, loss:0.01303278747946024 2019-10-30 19:46:01 Step:30040, loss:0.013723977841436863 2019-10-30 19:46:11 Step:30060, loss:0.011070373468101025 2019-10-30 19:46:22 Step:30080, loss:0.010913790203630924 2019-10-30 19:46:32 Step:30100, loss:0.014763432554900646 2019-10-30 19:46:43 Step:30120, loss:0.014030229300260544 2019-10-30 19:46:54 Step:30140, loss:0.008719007484614849 2019-10-30 19:47:04 Step:30160, loss:0.010375003330409527 2019-10-30 19:47:15 Step:30180, loss:0.009562562219798565 2019-10-30 19:47:25 Step:30200, loss:0.016761092469096184 Model collapsed with loss=40.253211975097656 Model collapsed with loss=612.2398071289062 Model collapsed with loss=95997.6875 Model collapsed with loss=43.66868591308594 Model collapsed with loss=363.3008728027344 Model collapsed with loss=889.5302124023438 Model collapsed with loss=44114.671875 Model collapsed with loss=1508190.375 2019-10-30 19:47:36 Step:30220, loss:1508190.375 Model collapsed with loss=133606592.0

psychopa4 commented 4 years ago

Training collapse can appear due to the gradient explosion, and when it happens, you should stop training and retrain the network ( reload the checkpoint before the collapse).

If training collapse happens frequently, you might as well choose a smaller learning rate.

I have added a break to these models, they will stop training once the collapse happens.

robotzheng commented 4 years ago

thanks for your anwser. Could you give me your piecewise_constant's steps?

psychopa4 commented 4 years ago

thanks for your anwser. Could you give me your piecewise_constant's steps?

We use following code for learning rate decay.

self.learning_rate=1e-3
self.end_lr=1e-4
self.decay_step=1.2e5
tf.train.polynomial_decay(self.learning_rate, global_step, self.decay_step, end_learning_rate=self.end_lr, power=1.)

The learning rate is decayed from 1e-3 to 1e-4 during 1.2e5 iterations gradually. Then, we train the network with lr=1e-4 until 1.5e5 iterations, after which we set the learning rate manually like

boundaries=[1.5e5, 1.7e5, 1.9e5] 
values=[1e-4, 0.5e-4, 0.25e-4 ,0.1e-4]

robotzheng commented 4 years ago

Thanks a lot.

Marshall-yao commented 4 years ago

@psychopa4 Hi, thanks for your wonderful performance work.

Besides , due to several times of stop when we train the network , it is extremely troublesome. Could you optimize the code thus the training process can proceed normally until the predetermined number of epochs ?

Thanks .

psychopa4 commented 4 years ago

@psychopa4 Hi, thanks for your wonderful performance work.

Besides , due to several times of stop when we train the network , it is extremely troublesome. Could you optimize the code thus the training process can proceed normally until the predetermined number of epochs ?

Thanks .

You may as well lower the initial learning rate, e.g. 0.9e-4, 0.8e-4.

Marshall-yao commented 4 years ago

Dear psychopa4: When I decrease the initial learning rate from 1e-3 to 0.9e-4 or 0.8e-4, how to set the end learning rate ? In addition, except reduce of the initial learning rate, I proceed to train the network when comes to model collapsed without changing the initial learning rate.

What is more, how to test vid4 test set ?

Best regard,

psychopa4 notifications@github.com 于2020年1月10日周五上午10:34写道：

@psychopa4 https://github.com/psychopa4 Hi, thanks for your wonderful performance work.

Besides , due to several times of stop when we train the network , it is extremely troublesome. Could you optimize the code thus the training process can proceed normally until the predetermined number of epochs ?

Thanks .

You may as well lower the initial learning rate, e.g. 0.9e-4, 0.8e-4.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/psychopa4/PFNL/issues/2?email_source=notifications&email_token=AKYUIOBGW444AZDLPWLLJGLQ47NBZA5CNFSM4JHDTDQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEISPCVQ#issuecomment-572846422, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKYUIOGWERTMSD5VIPJX5PLQ47NBZANCNFSM4JHDTDQA .

psychopa4 commented 4 years ago

Dear psychopa4: When I decrease the initial learning rate from 1e-3 to 0.9e-4 or 0.8e-4, how to set the end learning rate ? In addition, except reduce of the initial learning rate, I proceed to train the network when comes to model collapsed without changing the initial learning rate. What is more, how to test vid4 test set ? Best regard, Lu Yao. psychopa4 notifications@github.com 于2020年1月10日周五上午10:34写道： … @psychopa4 https://github.com/psychopa4 Hi, thanks for your wonderful performance work. Besides , due to several times of stop when we train the network , it is extremely troublesome. Could you optimize the code thus the training process can proceed normally until the predetermined number of epochs ? Thanks . You may as well lower the initial learning rate, e.g. 0.9e-4, 0.8e-4. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2?email_source=notifications&email_token=AKYUIOBGW444AZDLPWLLJGLQ47NBZA5CNFSM4JHDTDQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEISPCVQ#issuecomment-572846422>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKYUIOGWERTMSD5VIPJX5PLQ47NBZANCNFSM4JHDTDQA .

Sorry, I have written a number wrong, and it should be 0.9e-3. Use the function testvideos() in different models for testing.

Marshall-yao commented 4 years ago

Dear psychpa4: 1） Thanks for your sincere reply. 2） When I set the path of testvideos() to './data/test/vid4', I got the following error:

Traceback (most recent call last): File "main.py", line 15, in model.testvideos('./data/test/vid4') File "/home/19PFNL/model/pfnl.py", line 335, in testvideos self.test_video_truth(datapath, name=name, reuse=reuse, part=1000) File "/home/19PFNL/model/pfnl.py", line 210, in test_video_truth if max_frame % part == 0 : ZeroDivisionError: integer division or modulo by zero

Whether the testvideos() need to be modified ?

3） I think you can also provide pytorch version of the code.

Best regards,

psychopa4 notifications@github.com 于2020年1月10日周五下午3:38写道：

Dear psychopa4: When I decrease the initial learning rate from 1e-3 to 0.9e-4 or 0.8e-4, how to set the end learning rate ? In addition, except reduce of the initial learning rate, I proceed to train the network when comes to model collapsed without changing the initial learning rate. What is more, how to test vid4 test set ? Best regard, Lu Yao. psychopa4 notifications@github.com 于2020年1月10日周五上午10:34写道： … <#m-1715520444717898260> @psychopa4 https://github.com/psychopa4 https://github.com/psychopa4 Hi, thanks for your wonderful performance work. Besides , due to several times of stop when we train the network , it is extremely troublesome. Could you optimize the code thus the training process can proceed normally until the predetermined number of epochs ? Thanks . You may as well lower the initial learning rate, e.g. 0.9e-4, 0.8e-4. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 https://github.com/psychopa4/PFNL/issues/2?email_source=notifications&email_token=AKYUIOBGW444AZDLPWLLJGLQ47NBZA5CNFSM4JHDTDQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEISPCVQ#issuecomment-572846422>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKYUIOGWERTMSD5VIPJX5PLQ47NBZANCNFSM4JHDTDQA .

Sorry, I have written a number wrong, and it should be 0.9e-3. Use the function testvideos() in different models for testing.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/psychopa4/PFNL/issues/2?email_source=notifications&email_token=AKYUIOAHBLV55IOGOWDJAADQ5AQWXA5CNFSM4JHDTDQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIS65VA#issuecomment-572911316, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKYUIOHPRSWGYLCQVSNQRGDQ5AQWXANCNFSM4JHDTDQA .

psychopa4 / PFNL

Model collapsed，why？ #2