Closed zubairbaqai closed 2 years ago
You can check the code related to https://arxiv.org/abs/2103.13413 , they trained with adam and give details in the paper. Adam is much more agressive - so low learning rates should be used for finetuning compared to SGD.
I have Exerpimented on this code on many things , i have also introduced custom scedulers, but what i am not able to understand is , why SGD is working perfectly fine , while Adam optimizer isnt , i tried changing learning rate to different rate, but none seem to even start decreasing the loss. I used both SGD and adam from Torch.optim . any suggestions or help would be appreciated
Thanks