Closed ikawrakow closed 3 weeks ago
Merged in my fork of Kobold CPP. K q6_0 V q5_0 works like a charm. I also activated 16/6, 6/iq4_nl, as well as 8/6 and 6/6, I'll test them tonight or tomorrow. Edit : All the modes activated are working and are coherent in generation.
Thank you (very very much) and congratulation for this, IK, I'm delighted to have those options and thus the best inference quality I can get right now, and I'm gonna release soon an updated version of my fork, with the proper credits of course, so everyone interested and not too scared by downloading my patchwork can enjoy the fruit of your labors on these KV Quants, as some already enjoyed a bit more speed on CPU due to some of your commits that I was able to merge a few months ago!
As with
IQ4_NL
, just for head size of 128 for now. WithoutGGML_CUDA_FA_ALL_QUANTS
set, onlyQ6_0 + Q5_0
andQ8_0 + Q6_0
are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache from