https://github.com/tgc1997/RMN/blob/14a9eff9a936030bcea104cb2c65f5378136cd87/train.py#L128
Hi, Ganchao. I found the above judgement may miss some conditions during executing the project.
e.g. When the train_batch_size is set to 2 or 3, the step of the train_loader is 24390 (48779/2=24389.5) and 16260 (48779/3=16259.67) respectively. Here 48779 is the total number of samples for MSVD dataset. Note that the division operation is not completed. It means there are only 1 or 2 samples in the 24390th or 16260th step. And it doesn't meet the condition, bsz == opt.train_batch_szie. so the loss_count will be divided by 0 (i % 10). Ooops! : (
It could be refined like followings:
if bsz == opt.train_batch_size:
loss_count /= 10
elif bsz < opt.train_batch_size and i % 10 == 0:
loss_count /= 10
else:
loss_count /= i % 10
The project on my server restart again now. If it still works well after executing one epoch, I will come back to report.
https://github.com/tgc1997/RMN/blob/14a9eff9a936030bcea104cb2c65f5378136cd87/train.py#L128 Hi, Ganchao. I found the above judgement may miss some conditions during executing the project. e.g. When the train_batch_size is set to 2 or 3, the step of the train_loader is 24390 (48779/2=24389.5) and 16260 (48779/3=16259.67) respectively. Here 48779 is the total number of samples for MSVD dataset. Note that the division operation is not completed. It means there are only 1 or 2 samples in the 24390th or 16260th step. And it doesn't meet the condition, bsz == opt.train_batch_szie. so the loss_count will be divided by 0 (i % 10). Ooops! : (
It could be refined like followings:
The project on my server restart again now. If it still works well after executing one epoch, I will come back to report.