Closed ikawrakow closed 3 months ago
This is great!
ARM_NEON Here @compilade's implementation does not do very well
Yeah, I did not particularly optimize the ARM_NEON implementation for recent ARM CPUs (yet), especially since I did not use vdotq_s32
(although I was planning to), because the Cortex-A72 and the Cortex-A53 in the CPUs of my test machines do not support that and were faster with vmlal_s8
than with ggml_vdotq_s32
.
I see IQ2_TN
mostly has the same format as TQ2_0
, except that the float16 scale is before the packed weights instead of after.
But if I understand it correctly, both store the packed values in the same order and packed in the same way (same offset). Does that mean the Metal and CUDA implementations for IQ2_TN
would also work for TQ2_0
?
Do you have plans for IQ2_TN
to replace TQ2_0
, or is this something done in parallel to see how fast it can get with better matrix multiplication than lots of dot products?
Either way, I really appreciate your work on this. This was a pleasant surprise to see in my notifications.
Does that mean the Metal and CUDA implementations for IQ2_TN would also work for TQ2_0?
Do you have plans for IQ2_TN to replace TQ2_0, or is this something done in parallel to see how fast it can get with better matrix multiplication than lots of dot products?
I'm not planning on contributing any of this to the official llama.cpp
repository. Just hacking for fun.
The Metal implementation should be just a copy/paste operation (and replace IQ2_TN
with TQ2_0
, etc.) to add to Metal in llama.cpp
.
For CUDA I did factor out the dot products of the new quants into a separate file to avoid having to have a coffee each time I touch something there and mmq.cu / mmvq.cu
needs to be rebuilt. There are only that many coffees I can have in a day. Hence, you will need to rearrange a bit. But other than that, yes, it should just work.
They have abandoned the
Q1_3
andQ2_2
quants in PR-8151 inllama.cpp
, and have moved on toTQ1_0
andTQ2_0
. Like k-quants, these use blocks of 256 weights and utilizeQ8_K
for quantized dot products on the CPU. This removes support for Bitnet b1.58 (unless one adds padding to a multiple of 256), so they are now focussing on the TriLM models. Unlike the previousQ1_3
andQ2_2
, where the quantized data only holds the ternary-1/0/+1
values and the tensor scale is added via a separateggml_scale
operation, the newTQ1_0
andTQ2_0
include a scale in each block of 256. This basically wastes 0.0625 bpw, but has the advantage that one can simply reuse the standardllama.cpp
computation graphs.Based on the
PP-512
andTG-128
figures posted in PR-8151,TQ2_0
performance is much better than the earlierQ2_0
attempt, so I became curious to see how @compilade's implementation compares to what we can do withiqk_mul_mat
in this repo, and here we are.The PR adds
IQ2_TN
(TN
asTriNet
). Implementation forZen4
,AVX2
,ARM_NEON
,CUDA
andMetal
is provided.Let's look at some performance comparisons. We will focus on the largest TriLM model, which has ~4B parameters. Quantized with 2.0625 bpw the model size is 1.08 GiB.
AVX2
AVX2
was tested on a 32-core Ryzen-5975WX CPU. Not everybody has a 32-core CPU handy, so I have added performance values for fewer threads.I would say @compilade has done remarkably well here, coming to within ~14% for PP performance. Although, for fewer than 32 threads, the performance gap increases to about ~23%. My guess is that the 23% is a more realistic value for the performance difference, and as the number of threads increases we see more the effect of
ggml
inefficiencies (thread synchronization, operations that do not scale with number of threads, etc.), which then narrows the gap. Nevertheless, even 23% is remarkable considering the performance differences for other quants (see main page). For TG the performance is the same for 1 thread (not much one can do there, the bit arrangement is so simple that there aren't many different ways to implement effectively withAVX2
). The implementation in this PR then becomes faster, I guess due to better cache utilization. But this better per thread performance leads to too much memory bandwidth contention above 8 threads, soTQ2_0
is able to arrive at a slightly better performance at 16 threads.Zen4
I have also tested on a
Zen4
CPU (16-core Ryzen-7950X).Zen4
implements some of theAVX512
instruction set, and there is a dedicated implementation for that forIQ2_TN
. TheTQ2_0
quants are implemented in pureAVX2
, so one might think the performance comparison is unfair. But, at least as far as I know, theZen4
core implements 512-bit instructions as two separate 256-bit instructions in hardware, so one does not gain much by operating on 512-bit wide vectors. The main advantage comes from having more vector registers (32 vs 16 onAVX2
), but the way matrix multiplications are done inggml
(a series of vector x vector dot products), one cannot really take advantage of that. Anyway, here is the performance comparison on the Ryzen-7950X CPUHere the PP performance gap is more significant at around 66%, reducing to 55% at 16 threads. If we look at TG performance for 1 thread, the ~7% performance difference comes from using
_mm512_dpbusd_epi32
, which is a fused multiply-add operation, whereas onAVX2
one needs to use_mm256_maddubs_epi16
followed by_mm256_add_epi16
to accumulate the result. The TG performance gap then widens due to better cache utilization, and then decreases towards zero with increasing numbers of threads as the memory bandwidth is saturated. The 66% PP performance gap is hence the combination of the ~7% due to the use a fused multiply-add, and ~60% due to better utilization of vector registers while performing a multiplication of a row in the left matrix with several columns in the right matrix, where the unpacked quants for a block are held in vector registers.ARM_NEON
Here @compilade's implementation does not do very well, at least not on the M2-Max laptop where I have tested. But perhaps this is just due to the fact that @compilade used a Cortex A72 CPU in their development, and that CPU may as well behave very differently from the M2-Max.
CUDA and Metal
There is no GPU implementation in PR-8151, so here just the performance values for this PR.
CUDA
is tested on RTX-4080,Metal
on a 30-code M2-Max GPU.I have not bothered implementing the MMQ stuff, so CUDA PP performance is via dequantize and cuBLAS gemm.