glample / tagger

Named Entity Recognition Tool
Apache License 2.0
1.16k stars 426 forks source link

SGD x Adam #79

Closed pvcastro closed 6 years ago

pvcastro commented 6 years ago

Hi there @glample

Do you have any theories as to why, in your implementation, SGD is performing better than Adam optimizer (or any other optimizers, for that matter)? Do you think it's related to not having batch processing implemented?

Thanks!

glample commented 6 years ago

Hi,

My experience in general (and I know that many people had similar observations), is that SGD is what works best with batch size 1. Batch size 1 is also what works best in general, but people use bigger batch size (like 32 or 128) for training speed. When using bigger batch sizes, Adam usually gives better results than SGD. But well, this also depends a bit on the task.. But for NER I always observed that SGD was significantly the best.

pvcastro commented 6 years ago

Ok, thanks! I'm presenting a paper based on your LSTM-CRF architecture on a conference for Portuguese NLP in september ("Portuguese Named Entity Recognition using LSTM-CRF" - http://www.inf.ufrgs.br/propor-2018/accepted-papers/), so I'm getting ready for it. If you have any tips, they would be most welcome! Thanks!