microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

1.58bitnet - is it {-1,0,1}? #1502

Open rkinas opened 5 months ago

rkinas commented 5 months ago

Hi, thank you for providing 1.58bit implementation. Nice work! I looked through many bitnet1.58 implementations and noticed that they all use the method suggested in "The Era from 1-bit LLMs: Training Tips, Code and FAQ". The weights of the models that are currently trained according to this recipe are not numbers in the set {-1, 0, 1} and values in the interval (0,1). Is this the way it should be?

  1. The formula describing the quanztization of weights ("The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits").
  2. Implementation proposal ("The Era of 1-bit LLMs: Training Tips, Code and FAQ").
  3. Weights quantization test.
  4. Model during training.

1 58bitnet

sdmorrey commented 4 months ago

Can't speak for the author but the general idea is ternary is better since what's being built is really a directed acyclical knowledge graph. Ternary can be represented in 1.58 bits since you only need 1 bit for 1 and 0 and an additional sign bit on the rare occasion you have a -1.

CC0000000 commented 2 months ago

(I can be wrong, but anyway)

According to https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf

During training, x and w are scaled, quantized, and then rescaled back to the original scale. (Probably due to how the straight through estimators are used.)

During inference, rescaling are applied after linear operation. (p.6 fig.4) Weights are saved in the form of {-1, 0, 1} plus a separate scaling coefficient.