huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

FEAT: Adding 1.58bit LLMs training architecture in nanotron #180

Open MekkCyber opened 4 months ago

MekkCyber commented 4 months ago

Implementation of 1.58bit LLM with Llama following the paper & handbook released by Microsoft :

https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf

Here are The training results on 25B tokens :

loss_curve

cc @NouamaneTazi @xrsrke @thomwolf

xrsrke commented 4 months ago

Hello. Thanks for the PR. One question: The difference in loss here is very high? In the paper, it should be ~0.1, but here the difference is more than 0.5

image
MekkCyber commented 4 months ago

I think it has to do with the batch size. During our latest experiment, we trained the 1.58 model on 100B tokens, and we managed to get a 2.8 loss after 25B tokens with a batch size of 1024 :

lr