hahnyuan / PB-LLM

PB-LLM: Partially Binarized Large Language Models
MIT License
144 stars 10 forks source link

Compression Ratio #1

Open NicoNico6 opened 10 months ago

NicoNico6 commented 10 months ago

Really solid work!

May I ask what the actual compressed model size is, considering that it is a partial binarization way and there are some 8-bit parameters inside each weight matrix? Can we compress the model using techniques like bitpacking?

hahnyuan commented 7 months ago

Apologies for the delayed response.

Regarding your inquiry, for the 1-bit weights, we indeed use the packed format. As for the 8-bit parameters within each weight matrix, due to their sparse nature with a low percentage of density, conventional techniques like CSR may not be the most suitable. Currently, we are exploring the modified run-length encoding (RLE) to achieve an efficient compression ratio for the 8-bit sparse data.

In our modified RLE, each 8-bit data point is represented by a pair of values: the actual 8-bit data and the count of consecutive occurrences of leading zeros. For example, original sequence: 0 0 0 0 0 0 5 0 0 1. RLE representation: (6, 2) (5, 1).

Considering the storage cost, the RLE representation typically involves storing the value and count as pairs, and each pair might require 12 bits (8 bits for the value and 4 bits for the count).

For 10% outlier, if we quantize the weights to 8-bit, the average bits for each value is 1+(8+4)0.1=2.2 bits (compression ratio=1-2.2/16=86.3%). If we quantize the weights to 4-bit, the average bits for each value can be reduced to 1+(4+4)0.1=1.8 bits (compression ratio=1-1.8/16=88.8%).