Closed fat-tire closed 8 months ago
That seems like a general technique and not particular to an architecture.
Yes, I only thought it might be relevant to its implementation (?), and given all the excitement, maybe something to play with in the mamba reference code. :shrug: Still, if it's not relevant or interesting, I'll close this (or feel free to do so!)
Closing to keep the issue queue down. Congrats on the amazing work!
/r/locallama and Hacker News etc are all buzzing today about this BitNet b1.58 paper that claims extraordinary gains in model size efficiency, energy usage, and inference speed with similar results as full-precision 16-bit weights... by using ternary values {-1, 0, 1} during training.
Didn't see anything mentioning BitNet b1.58 here yet, so... might Mamba-based LLMs also benefit from anything proposed in the paper, assuming it works as advertised?