Training stops running after first epoch

Ataraxy commented 7 years ago

Hey there.

It appears that the training script now hangs after the first epoch completes and no longer continues on to the next.

hwalsuklee commented 7 years ago

First check your code version. I remember that older vision hangs in while-loop after one-epoch. Or notice that epoch index is zero-based. if you set number of epochs two, training ends with epoch index 1, not 2. Also please check 'iter' in print sentence, which is 'global_step' value and must be (number of epoch)*(number of images in DB = 82776)/(batch size). In case of number of epoch = 2, batch size = 8, training ends with 20694 global step.

Ataraxy commented 7 years ago

The last two iterations before hanging were:

epoch : 0, iter : 20694, L_total : 7.95319e+06, L_content : 4.40477e+06, L_style : 3.34505e+06, L_tv : 203373 epoch : 0, iter : 20695, L_total : 8.73966e+06, L_content : 5.60493e+06, L_style : 2.92885e+06, L_tv : 205872

Though that's a with epoch = 2 and batch size = 4

I did update everything since your last changes and did successfully train models to completion before that.

Sorry to be a bother.

hwalsuklee commented 7 years ago

Oops. I found a missing statement. I'm on the way to my office. If you want fix code for yourself, please check that when after updating epoch index at end of first while loop (line number 256), there must be iterations = iterations - epoch*(num_examples // self.batch_size)

1. 1. 오전 6:19에 "Stephan Martelly" notifications@github.com님이 작성:

The last two iterations before hanging were:

epoch : 0, iter : 20694, L_total : 7.95319e+06, L_content : 4.40477e+06, L_style : 3.34505e+06, L_tv : 203373 epoch : 0, iter : 20695, L_total : 8.73966e+06, L_content : 5.60493e+06, L_style : 2.92885e+06, L_tv : 205872

Though that's a with epoch = 2 and batch size = 4

I did update everything since your last changes and did successfully train models to completion before that.

Sorry to be a bother.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hwalsuklee/tensorflow-fast-style-transfer/issues/3#issuecomment-282856716, or mute the thread https://github.com/notifications/unsubscribe-auth/ASkBqEEQr6ElzrngJt99k8Tl8DwD5VN-ks5rgz3UgaJpZM4MNlwJ .

hwalsuklee commented 7 years ago

Sorry..

The statement must be

iterations = step - epoch*(num_examples // self.batch_size)

1. 1. 오전 6:19에 "Stephan Martelly" notifications@github.com님이 작성:

The last two iterations before hanging were:

epoch : 0, iter : 20694, L_total : 7.95319e+06, L_content : 4.40477e+06, L_style : 3.34505e+06, L_tv : 203373 epoch : 0, iter : 20695, L_total : 8.73966e+06, L_content : 5.60493e+06, L_style : 2.92885e+06, L_tv : 205872

Though that's a with epoch = 2 and batch size = 4

I did update everything since your last changes and did successfully train models to completion before that.

Sorry to be a bother.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hwalsuklee/tensorflow-fast-style-transfer/issues/3#issuecomment-282856716, or mute the thread https://github.com/notifications/unsubscribe-auth/ASkBqEEQr6ElzrngJt99k8Tl8DwD5VN-ks5rgz3UgaJpZM4MNlwJ .

Ataraxy commented 7 years ago

Thanks, I'll try this!

Ataraxy commented 7 years ago

The fix you made works, thanks!

Just for posterity in case someone comes along, the above wasn't the solution but this commit fixed it.

hwalsuklee / tensorflow-fast-style-transfer

Training stops running after first epoch #3