Cannot restore from checkpoint

noble6emc2 commented 6 years ago

Hi, I have some issue with the Bidaf example scripts in the nikosk/bidaf branch. When i turn on the restore function(--restart False), it's supposed to continue the work from the existing model file rather than start from scratch. But according to the info in the bash, it seems there is no difference between switching it on and off.

Validated 5407 sequences, loss 6.9981, F1 0.1199, EM 0.0523, precision 0.130760, recall 0.127757 hasOverlap 0.150915, start_match 0.066026, end_match 0.104679
Validated 5407 sequences, loss 6.9981, F1 0.1199, EM 0.0523, precision 0.130760, recall 0.127757 hasOverlap 0.150915, start_match 0.066026, end_match 0.104679
Validated 5407 sequences, loss 6.9981, F1 0.1199, EM 0.0523, precision 0.130760, recall 0.127757 hasOverlap 0.150915, start_match 0.066026, end_match 0.104679
Validated 5407 sequences, loss 6.9981, F1 0.1199, EM 0.0523, precision 0.130760, recall 0.127757 hasOverlap 0.150915, start_match 0.066026, end_match 0.104679
NcclComm: initialized
NcclComm: initialized
NcclComm: initialized
NcclComm: initialized
 Minibatch[   1- 500]: loss = 7.146891 * 22750, metric = 0.00% * 22750;
 Minibatch[   1- 500]: loss = 7.146891 * 22750, metric = 0.00% * 22750;
 Minibatch[   1- 500]: loss = 7.146891 * 22750, metric = 0.00% * 22750;
 Minibatch[   1- 500]: loss = 7.146891 * 22750, metric = 0.00% * 22750;
 Minibatch[ 501-1000]: loss = 7.118168 * 22875, metric = 0.00% * 22875;
 Minibatch[ 501-1000]: loss = 7.118168 * 22875, metric = 0.00% * 22875;
 Minibatch[ 501-1000]: loss = 7.118168 * 22875, metric = 0.00% * 22875;
 Minibatch[ 501-1000]: loss = 7.118168 * 22875, metric = 0.00% * 22875;
 Minibatch[1001-1500]: loss = 7.108578 * 22672, metric = 0.00% * 22672;
 Minibatch[1001-1500]: loss = 7.108578 * 22672, metric = 0.00% * 22672;
 Minibatch[1001-1500]: loss = 7.108578 * 22672, metric = 0.00% * 22672;
 Minibatch[1001-1500]: loss = 7.108578 * 22672, metric = 0.00% * 22672;
Finished Epoch[1 of 300]: [Training] loss = 7.121562 * 82336, metric = 0.00% * 82336 1060.075s ( 77.7 samples/s);
Finished Epoch[1 of 300]: [Training] loss = 7.121562 * 82336, metric = 0.00% * 82336 1060.161s ( 77.7 samples/s);
Finished Epoch[1 of 300]: [Training] loss = 7.121562 * 82336, metric = 0.00% * 82336 1059.991s ( 77.7 samples/s);
Finished Epoch[1 of 300]: [Training] loss = 7.121562 * 82336, metric = 0.00% * 82336 1061.306s ( 77.6 samples/s);
Validated 5407 sequences, loss 9.8278, F1 0.0731, EM 0.0080, precision 0.058390, recall 0.272056 hasOverlap 0.284261, start_match 0.038284, end_match 0.045312
Validated 5407 sequences, loss 9.8278, F1 0.0731, EM 0.0080, precision 0.058390, recall 0.272056 hasOverlap 0.284261, start_match 0.038284, end_match 0.045312
Validated 5407 sequences, loss 9.8278, F1 0.0731, EM 0.0080, precision 0.058390, recall 0.272056 hasOverlap 0.284261, start_match 0.038284, end_match 0.045312
Validated 5407 sequences, loss 9.8278, F1 0.0731, EM 0.0080, precision 0.058390, recall 0.272056 hasOverlap 0.284261, start_match 0.038284, end_match 0.045312
 Minibatch[   1- 500]: loss = 7.556977 * 22861, metric = 0.00% * 22861;
 Minibatch[   1- 500]: loss = 7.556977 * 22861, metric = 0.00% * 22861;
 Minibatch[   1- 500]: loss = 7.556977 * 22861, metric = 0.00% * 22861;
 Minibatch[   1- 500]: loss = 7.556977 * 22861, metric = 0.00% * 22861;
 Minibatch[ 501-1000]: loss = 7.279349 * 22781, metric = 0.00% * 22781;
 Minibatch[ 501-1000]: loss = 7.279349 * 22781, metric = 0.00% * 22781;
 Minibatch[ 501-1000]: loss = 7.279349 * 22781, metric = 0.00% * 22781;
 Minibatch[ 501-1000]: loss = 7.279349 * 22781, metric = 0.00% * 22781;
 Minibatch[1001-1500]: loss = 7.213099 * 22784, metric = 0.00% * 22784;
 Minibatch[1001-1500]: loss = 7.213099 * 22784, metric = 0.00% * 22784;
 Minibatch[1001-1500]: loss = 7.213099 * 22784, metric = 0.00% * 22784;
 Minibatch[1001-1500]: loss = 7.213099 * 22784, metric = 0.00% * 22784;
Finished Epoch[2 of 300]: [Training] loss = 7.319536 * 82341, metric = 0.00% * 82341 1059.168s ( 77.7 samples/s);
Finished Epoch[2 of 300]: [Training] loss = 7.319536 * 82341, metric = 0.00% * 82341 1059.168s ( 77.7 samples/s);
Finished Epoch[2 of 300]: [Training] loss = 7.319536 * 82341, metric = 0.00% * 82341 1059.168s ( 77.7 samples/s);
Finished Epoch[2 of 300]: [Training] loss = 7.319536 * 82341, metric = 0.00% * 82341 1059.168s ( 77.7 samples/s);
Validated 5407 sequences, loss 7.1820, F1 0.1103, EM 0.0477, precision 0.119341, recall 0.121212 hasOverlap 0.142778, start_match 0.061217, end_match 0.095247
Validated 5407 sequences, loss 7.1820, F1 0.1103, EM 0.0477, precision 0.119341, recall 0.121212 hasOverlap 0.142778, start_match 0.061217, end_match 0.095247
Validated 5407 sequences, loss 7.1820, F1 0.1103, EM 0.0477, precision 0.119341, recall 0.121212 hasOverlap 0.142778, start_match 0.061217, end_match 0.095247
Validated 5407 sequences, loss 7.1820, F1 0.1103, EM 0.0477, precision 0.119341, recall 0.121212 hasOverlap 0.142778, start_match 0.061217, end_match 0.095247

As you can see above, the training process is restored at the beginning and model's loss starts at 6.9981. However, after the first epoch it suddenly jumps up to 9.8. And below is the info when I turn off the restore function(--restart True).

NcclComm: initialized
NcclComm: initialized
NcclComm: initialized
NcclComm: initialized
 Minibatch[   1- 500]: loss = 11.867335 * 22750, metric = 0.00% * 22750;
 Minibatch[   1- 500]: loss = 11.867335 * 22750, metric = 0.00% * 22750;
 Minibatch[   1- 500]: loss = 11.867335 * 22750, metric = 0.00% * 22750;
 Minibatch[   1- 500]: loss = 11.867335 * 22750, metric = 0.00% * 22750;
 Minibatch[ 501-1000]: loss = 9.361975 * 22875, metric = 0.00% * 22875;
 Minibatch[ 501-1000]: loss = 9.361975 * 22875, metric = 0.00% * 22875;
 Minibatch[ 501-1000]: loss = 9.361975 * 22875, metric = 0.00% * 22875;
 Minibatch[ 501-1000]: loss = 9.361975 * 22875, metric = 0.00% * 22875;
 Minibatch[1001-1500]: loss = 7.946887 * 22672, metric = 0.00% * 22672;
 Minibatch[1001-1500]: loss = 7.946887 * 22672, metric = 0.00% * 22672;
 Minibatch[1001-1500]: loss = 7.946887 * 22672, metric = 0.00% * 22672;
 Minibatch[1001-1500]: loss = 7.946887 * 22672, metric = 0.00% * 22672;
Finished Epoch[1 of 300]: [Training] loss = 9.378777 * 82336, metric = 0.00% * 82336 1174.005s ( 70.1 samples/s);
Finished Epoch[1 of 300]: [Training] loss = 9.378777 * 82336, metric = 0.00% * 82336 1174.851s ( 70.1 samples/s);
Finished Epoch[1 of 300]: [Training] loss = 9.378777 * 82336, metric = 0.00% * 82336 1169.157s ( 70.4 samples/s);
Finished Epoch[1 of 300]: [Training] loss = 9.378777 * 82336, metric = 0.00% * 82336 1168.329s ( 70.5 samples/s);
Validated 5407 sequences, loss 10.5446, F1 0.0777, EM 0.0083, precision 0.055601, recall 0.331254 hasOverlap 0.343444, start_match 0.042168, end_match 0.036804
Validated 5407 sequences, loss 10.5446, F1 0.0777, EM 0.0083, precision 0.055601, recall 0.331254 hasOverlap 0.343444, start_match 0.042168, end_match 0.036804
Validated 5407 sequences, loss 10.5446, F1 0.0777, EM 0.0083, precision 0.055601, recall 0.331254 hasOverlap 0.343444, start_match 0.042168, end_match 0.036804
Validated 5407 sequences, loss 10.5446, F1 0.0777, EM 0.0083, precision 0.055601, recall 0.331254 hasOverlap 0.343444, start_match 0.042168, end_match 0.036804
 Minibatch[   1- 500]: loss = 7.583289 * 22861, metric = 0.00% * 22861;
 Minibatch[   1- 500]: loss = 7.583289 * 22861, metric = 0.00% * 22861;
 Minibatch[   1- 500]: loss = 7.583289 * 22861, metric = 0.00% * 22861;
 Minibatch[   1- 500]: loss = 7.583289 * 22861, metric = 0.00% * 22861;
 Minibatch[ 501-1000]: loss = 7.527095 * 22781, metric = 0.00% * 22781;
 Minibatch[ 501-1000]: loss = 7.527095 * 22781, metric = 0.00% * 22781;
 Minibatch[ 501-1000]: loss = 7.527095 * 22781, metric = 0.00% * 22781;
 Minibatch[ 501-1000]: loss = 7.527095 * 22781, metric = 0.00% * 22781;
 Minibatch[1001-1500]: loss = 7.469418 * 22784, metric = 0.00% * 22784;
 Minibatch[1001-1500]: loss = 7.469418 * 22784, metric = 0.00% * 22784;
 Minibatch[1001-1500]: loss = 7.469418 * 22784, metric = 0.00% * 22784;
 Minibatch[1001-1500]: loss = 7.469418 * 22784, metric = 0.00% * 22784;
Finished Epoch[2 of 300]: [Training] loss = 7.507434 * 82341, metric = 0.00% * 82341 1399.446s ( 58.8 samples/s);
Finished Epoch[2 of 300]: [Training] loss = 7.507434 * 82341, metric = 0.00% * 82341 1399.444s ( 58.8 samples/s);
Finished Epoch[2 of 300]: [Training] loss = 7.507434 * 82341, metric = 0.00% * 82341 1399.447s ( 58.8 samples/s);
Finished Epoch[2 of 300]: [Training] loss = 7.507434 * 82341, metric = 0.00% * 82341 1399.447s ( 58.8 samples/s);

From above it looks like that there is no difference. The only explanation I can think of is that the restore_from_checkpoint function doesn't restore the learning rate( begins at 2). Though the model is successfully restored, yet learning rate is still 2, causing loss to deteriorate after an epoch. I actually have no idea what has happened. Please help me!!! :(

ke1337 commented 6 years ago

Learning rate for BIDAF is fixed at 2 as it uses adadelta. BTW for training progress I think loss on training data is a better indicator. The initial validate loss after restore being 6.9981 might have something to do with model exponential moving average which is not saved in checkpoint. If looking at training loss and validation loss after a few epoch, it seems resume from checkpoint works as expected.

noble6emc2 commented 6 years ago

So will it be better if I turn on the restore option? The loss jumps up high to 9.8278, which is almost the same as the loss when I restart again(10.5446). So I'm kinda of wondering if restoring from checkpoint saves me time......

ke1337 commented 6 years ago

Looked at the code again, it seems save_checkpoint saves the EMA model when test loss is lower. For checkpoint to work better, I think EMA model should be set to model value after restore_checkpoint. Could you try change this part to have EMA restored? Something like:

for p in z.parameters:
    ema[p.uid].value = p.value

noble6emc2 commented 6 years ago

Okay，I have no more question. Thanks again

microsoft / CNTK

Cannot restore from checkpoint #2894