Closed ikawrakow closed 2 months ago
For the record, here is how this PR improves IQ1/2_BN
performance for PP
model | backend | threads | test | t/s (main) | TS (PR) | Speedup |
---|---|---|---|---|---|---|
bitnet 3B IQ2_BN | Zen4 | 16 | pp512 | 515.59 ± 2.05 | 606.56 ± 6.29 | 1.176 |
bitnet 3B IQ1_BN | Zen4 | 16 | pp512 | 411.92 ± 0.30 | 571.68 ± 2.42 | 1.388 |
bitnet 3B IQ2_BN | AVX2 | 32 | pp512 | 637.75 ± 0.92 | 772.61 ± 1.27 | 1.211 |
bitnet 3B IQ1_BN | AVX2 | 32 | pp512 | 517.17 ± 0.54 | 650.72 ± 6.02 | 1.258 |
bitnet 3B IQ2_BN | NEON | 8 | pp512 | 242.97 ± 0.60 | 247.82 ± 0.68 | 1.020 |
bitnet 3B IQ1_BN | NEON | 8 | pp512 | 207.05 ± 0.48 | 211.21 ± 0.65 | 1.020 |
For the Bitnt-1.58b ternary models I had added
IQ1_BN
(1.625 bpw) andIQ2_BN
(2.0 bpw) quants. But for TriLM I only addedIQ2_TN
(2.0625 bpw). This PR fills the gap adding the corresponding 1.6875 bpw quantization typeIQ1_TN
.The matrix multiplication implementation simply reuses the existing
IQ1_BN
implementation. We just need to add the multiplication with the row scale at the end of a vector dot product between a row in the left matrix and a column in the right matrix (inIQ1_BN
there are no scales in the quantized data, and the scale is applied separately via aggml_scale
operation).While adding
IQ1_TN
to theIQ1_BN
implementation, I noticed an optimization opportunity. As a result, this PR also improvesIQ1_BN
performance andIQ2_BN
performance.As PR-8151 has now been merged in mainline
llama.cpp
I was curious to compareIQ1_TN
with the correspondingTQ1_0
andIQ2_TN
with the correspondingTQ2_0
inllama.cpp
.The CPU's used in the comparisons below are Ryzen-7950X (Zen4), Ryzen-5975WX (AVX2) and M2-Max (NEON).
IQ1_TN vs TQ1_0, 4B TriLM model
IQ2_TN vs TQ2_0, 4B TriLM model
As
IQ2_BN
PP performance is better thanIQ1_BN
, these tables indicate that myIQ2_TN
implementation onZen4/AVX2
is likely not optimal. There also seem to be a bottleneck somewhere for TG with more than 8 threads than I need to look into.