Closed Ataraxy closed 7 years ago
First check your code version. I remember that older vision hangs in while-loop after one-epoch. Or notice that epoch index is zero-based. if you set number of epochs two, training ends with epoch index 1, not 2. Also please check 'iter' in print sentence, which is 'global_step' value and must be (number of epoch)*(number of images in DB = 82776)/(batch size). In case of number of epoch = 2, batch size = 8, training ends with 20694 global step.
The last two iterations before hanging were:
epoch : 0, iter : 20694, L_total : 7.95319e+06, L_content : 4.40477e+06, L_style : 3.34505e+06, L_tv : 203373
epoch : 0, iter : 20695, L_total : 8.73966e+06, L_content : 5.60493e+06, L_style : 2.92885e+06, L_tv : 205872
Though that's a with epoch = 2 and batch size = 4
I did update everything since your last changes and did successfully train models to completion before that.
Sorry to be a bother.
Oops. I found a missing statement. I'm on the way to my office. If you want fix code for yourself, please check that when after updating epoch index at end of first while loop (line number 256), there must be iterations = iterations - epoch*(num_examples // self.batch_size)
The last two iterations before hanging were:
epoch : 0, iter : 20694, L_total : 7.95319e+06, L_content : 4.40477e+06, L_style : 3.34505e+06, L_tv : 203373 epoch : 0, iter : 20695, L_total : 8.73966e+06, L_content : 5.60493e+06, L_style : 2.92885e+06, L_tv : 205872
Though that's a with epoch = 2 and batch size = 4
I did update everything since your last changes and did successfully train models to completion before that.
Sorry to be a bother.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hwalsuklee/tensorflow-fast-style-transfer/issues/3#issuecomment-282856716, or mute the thread https://github.com/notifications/unsubscribe-auth/ASkBqEEQr6ElzrngJt99k8Tl8DwD5VN-ks5rgz3UgaJpZM4MNlwJ .
Sorry..
The statement must be
iterations = step - epoch*(num_examples // self.batch_size)
The last two iterations before hanging were:
epoch : 0, iter : 20694, L_total : 7.95319e+06, L_content : 4.40477e+06, L_style : 3.34505e+06, L_tv : 203373 epoch : 0, iter : 20695, L_total : 8.73966e+06, L_content : 5.60493e+06, L_style : 2.92885e+06, L_tv : 205872
Though that's a with epoch = 2 and batch size = 4
I did update everything since your last changes and did successfully train models to completion before that.
Sorry to be a bother.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hwalsuklee/tensorflow-fast-style-transfer/issues/3#issuecomment-282856716, or mute the thread https://github.com/notifications/unsubscribe-auth/ASkBqEEQr6ElzrngJt99k8Tl8DwD5VN-ks5rgz3UgaJpZM4MNlwJ .
Thanks, I'll try this!
The fix you made works, thanks!
Just for posterity in case someone comes along, the above wasn't the solution but this commit fixed it.
Hey there.
It appears that the training script now hangs after the first epoch completes and no longer continues on to the next.