In ASGD, what do we use for parameter, is it averaged one or normal SGD one?

salesforce / awd-lstm-lm

LSTM and QRNN Language Model Toolkit for PyTorch

BSD 3-Clause "New" or "Revised" License

1.96k stars 488 forks source link

In ASGD, what do we use for parameter, is it averaged one or normal SGD one? #54

Closed SongJeongHyun closed 6 years ago

SongJeongHyun commented 6 years ago

I am so confused whether which parameter do we use during training. Is it one Averaged from Time T or just normal SGD one?

keskarnitish commented 6 years ago

In ASGD, we maintain a running average of the iterates from the trigger. However, this running average is not used for computing gradients. Only during inference (validation and testing), we overwrite the parameters with the averaged ones.

Closed this now, feel free to open if necessary.