Closed ikawrakow closed 2 months ago
With this change we get PP512 = 494 t/s (using flash attention), up from 468 t/s (~5% improvement) running on a Ryzen-7950X CPU.
PP512 = 494 t/s
468 t/s
Compared to the initial IQ2_TN PR #13 the cumulative improvement is 15%.
IQ2_TN
Compared to TQ2_0 in llama.cpp, which has now been merged, we are now 80% faster.
TQ2_0
llama.cpp
With this change we get
PP512 = 494 t/s
(using flash attention), up from468 t/s
(~5% improvement) running on a Ryzen-7950X CPU.Compared to the initial
IQ2_TN
PR #13 the cumulative improvement is 15%.Compared to
TQ2_0
inllama.cpp
, which has now been merged, we are now 80% faster.