Issue: UI is missing option to change Flash Attention KV quantization setting

GabeAl commented 2 months ago

I accidentally posted a bug in the cli version of the bug tracker,

Bug fix: Flash Attention - KV cache quantization is stuck at FP16 with no way to revert to Q4_0

The gist of it is, no way to set flash attention quants = no way to fit large contexts on the GPU = regression.

This leads to a large series of usability regressions.

Impossible to load 128k context on GPU
Slower performance even with the reduced context that fits on the GPU
Inability to analyze long documents with the same accuracy as before (which is ironic considering 0.3's aim to enable this)
All issues associated with rolling back to 0.2X series, including chat history import.

Feel free to close the other one I made earlier: https://github.com/lmstudio-ai/lms/issues/70

dmatora commented 2 months ago

I had to switch to ollama to use 70B 128K. Please add support for Q4/Q8 KV cache quantization

yagil commented 2 months ago

@dmatora @GabeAl is the specific ask here a way to set the KV cache quantization level?

dmatora commented 2 months ago

yes

GabeAl commented 2 months ago

Yes absolutely. Many users have commented including on Discord about this. It is one major bottleneck holding users back.

I depend on large contexts, and large contexts need quantization (q4) to be feasible on commodity hardware.

lmstudio-ai / lmstudio-bug-tracker