Hi!
Thank you for the very insightful and useful paper.
I was testing it with a custom dataset and I found that using Adam without weight decay, the loss is way more stable than with using SGD with weight decay. But I'm not sure if can affect in any way the learning or if it conflicts with some theoretical background around the paper.
And actually with SGD I was observing a much faster collapse than with Adam (but I guess it can also depend on the choice of the hyperparameters).
I was wondering if you guys have made some sort of ablation study regarding the optimizers or if there is a reasoning behind the choice of SGD.
Thank you in advance!
Hi! Thank you for the very insightful and useful paper. I was testing it with a custom dataset and I found that using Adam without weight decay, the loss is way more stable than with using SGD with weight decay. But I'm not sure if can affect in any way the learning or if it conflicts with some theoretical background around the paper. And actually with SGD I was observing a much faster collapse than with Adam (but I guess it can also depend on the choice of the hyperparameters). I was wondering if you guys have made some sort of ablation study regarding the optimizers or if there is a reasoning behind the choice of SGD. Thank you in advance!