Open GabeAl opened 2 months ago
I had to switch to ollama to use 70B 128K. Please add support for Q4/Q8 KV cache quantization
@dmatora @GabeAl is the specific ask here a way to set the KV cache quantization level?
yes
Yes absolutely. Many users have commented including on Discord about this. It is one major bottleneck holding users back.
I depend on large contexts, and large contexts need quantization (q4) to be feasible on commodity hardware.
I accidentally posted a bug in the cli version of the bug tracker,
Bug fix: Flash Attention - KV cache quantization is stuck at FP16 with no way to revert to Q4_0
The gist of it is, no way to set flash attention quants = no way to fit large contexts on the GPU = regression.
This leads to a large series of usability regressions.
Feel free to close the other one I made earlier: https://github.com/lmstudio-ai/lms/issues/70