Open noble6emc2 opened 6 years ago
Learning rate for BIDAF is fixed at 2 as it uses adadelta. BTW for training progress I think loss on training data is a better indicator. The initial validate loss after restore being 6.9981 might have something to do with model exponential moving average which is not saved in checkpoint. If looking at training loss and validation loss after a few epoch, it seems resume from checkpoint works as expected.
So will it be better if I turn on the restore option? The loss jumps up high to 9.8278, which is almost the same as the loss when I restart again(10.5446). So I'm kinda of wondering if restoring from checkpoint saves me time......
Looked at the code again, it seems save_checkpoint saves the EMA model when test loss is lower. For checkpoint to work better, I think EMA model should be set to model value after restore_checkpoint. Could you try change this part to have EMA restored? Something like:
for p in z.parameters:
ema[p.uid].value = p.value
Okay,I have no more question. Thanks again
Hi, I have some issue with the Bidaf example scripts in the nikosk/bidaf branch. When i turn on the restore function(--restart False), it's supposed to continue the work from the existing model file rather than start from scratch. But according to the info in the bash, it seems there is no difference between switching it on and off.
As you can see above, the training process is restored at the beginning and model's loss starts at 6.9981. However, after the first epoch it suddenly jumps up to 9.8. And below is the info when I turn off the restore function(--restart True).
From above it looks like that there is no difference. The only explanation I can think of is that the restore_from_checkpoint function doesn't restore the learning rate( begins at 2). Though the model is successfully restored, yet learning rate is still 2, causing loss to deteriorate after an epoch. I actually have no idea what has happened. Please help me!!! :(