Closed SongJeongHyun closed 6 years ago
In ASGD, we maintain a running average of the iterates from the trigger. However, this running average is not used for computing gradients. Only during inference (validation and testing), we overwrite the parameters with the averaged ones.
Closed this now, feel free to open if necessary.
I am so confused whether which parameter do we use during training. Is it one Averaged from Time T or just normal SGD one?