Hi, i am recently studying about averaging method on Optimizations.
I read your paper 'Regularizing and Optimizing LSTM Language Models' and trying to follow your experiment only on PTB. I have few questions about source code.
In your source code main.py, at line 276, you are using 「't0' not in optimizer.param_groups[0]」 condition. I can understand this condition at all. What does this condition mean?
At the same line, there is condition 「len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[:-args.nonmono])」.
Is this mean 「After args.nonmono」 of Logging Interval L」 and 「Validation loss of right now epoch is bigger then previous args.nonmono of Logging Interval L」?
After changing SGD to ASGD, how does the program keep update parameter?
Does it update the parameter with SGD before Last Epoch and return the Averaged parameter at the end
Or
Update the parameter by averaging every iteration, Epoch or some interval
It is in the same context with Q3. After the program switch optimization to ASGD, the Validation PPL, BPC stop to changing but the training PPL,BPC keep changing. Why does it happen?
Is there any Averaging stop criterion in this program? If so, what is it?
Is there any Training STOP criterion without maximum EPOCH?
Why did you choose 750 EPOCH as MAXIMUM EPOCH? Is it just because you though it is large enough?
Hi, i am recently studying about averaging method on Optimizations. I read your paper 'Regularizing and Optimizing LSTM Language Models' and trying to follow your experiment only on PTB. I have few questions about source code.
In your source code main.py, at line 276, you are using 「't0' not in optimizer.param_groups[0]」 condition. I can understand this condition at all. What does this condition mean?
At the same line, there is condition 「len(best_val_loss)>args.nonmono and val_loss > min(best_val_loss[:-args.nonmono])」.
Is this mean 「After args.nonmono」 of Logging Interval L」 and 「Validation loss of right now epoch is bigger then previous args.nonmono of Logging Interval L」?
After changing SGD to ASGD, how does the program keep update parameter? Does it update the parameter with SGD before Last Epoch and return the Averaged parameter at the end Or Update the parameter by averaging every iteration, Epoch or some interval
It is in the same context with Q3. After the program switch optimization to ASGD, the Validation PPL, BPC stop to changing but the training PPL,BPC keep changing. Why does it happen?
Is there any Averaging stop criterion in this program? If so, what is it?
Is there any Training STOP criterion without maximum EPOCH?
Why did you choose 750 EPOCH as MAXIMUM EPOCH? Is it just because you though it is large enough?