kyegomez / BitNet

Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch
https://discord.gg/qUtxnK2NMf
MIT License
1.53k stars 142 forks source link

BitNet model performs wrose than Base Transformer #55

Open johanssontan opened 4 months ago

johanssontan commented 4 months ago

I use the train.py to train the BitNet model and Base Transformer Model and have a comparison with them, I found BitNet consumes more time and space while achieveing lower loss compared to base model, which is not consistent with what the BitNet paper clamins. What could be the reason for this?

Upvote & Fund

Fund with Polar

NewJerseyStyle commented 2 months ago

Not sure if it is a bug...

In my understanding of BitNet, training BitNet will cost more space as both int1 weights and fp16 weights (if you go for mixed precision training else fp32) are stored. It will be slim and fast for inference. 🤔

github-actions[bot] commented 6 days ago

Stale issue message