Open rkinas opened 5 months ago
Can't speak for the author but the general idea is ternary is better since what's being built is really a directed acyclical knowledge graph. Ternary can be represented in 1.58 bits since you only need 1 bit for 1 and 0 and an additional sign bit on the rare occasion you have a -1.
(I can be wrong, but anyway)
According to https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf
During training, x and w are scaled, quantized, and then rescaled back to the original scale. (Probably due to how the straight through estimators are used.)
During inference, rescaling are applied after linear operation. (p.6 fig.4) Weights are saved in the form of {-1, 0, 1} plus a separate scaling coefficient.
Hi, thank you for providing 1.58bit implementation. Nice work! I looked through many bitnet1.58 implementations and noticed that they all use the method suggested in "The Era from 1-bit LLMs: Training Tips, Code and FAQ". The weights of the models that are currently trained according to this recipe are not numbers in the set {-1, 0, 1} and values in the interval (0,1). Is this the way it should be?