Open MekkCyber opened 4 months ago
Hello. Thanks for the PR. One question: The difference in loss here is very high? In the paper, it should be ~0.1, but here the difference is more than 0.5
I think it has to do with the batch size. During our latest experiment, we trained the 1.58 model on 100B tokens, and we managed to get a 2.8 loss after 25B tokens with a batch size of 1024 :
Implementation of 1.58bit LLM with Llama following the paper & handbook released by Microsoft :
https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf
Here are The training results on 25B tokens :
cc @NouamaneTazi @xrsrke @thomwolf