lmstudio-ai / lmstudio-bug-tracker

Bug tracking for the LM Studio desktop application
10 stars 3 forks source link

Bug: No way to set the quantisation used for the k/v context cache #186

Open sammcj opened 2 weeks ago

sammcj commented 2 weeks ago

I was trying to find where to set which quantisation to use for the K/V context cache and it seems you can't in LM Studio.

K/V cache quantisation is required to run models context efficiently by reducing vRAM usage, as such it allows running larger context sizes than if you're running the K/V at full fp16 which gives little to no benefit over Q8_0.

vLLM, Aphrodite-Engine, Llamacpp, Exllamav2, Mistral-RS, MLX etc... all support this.

When running llama.cpp directly this is the --cache-type-k and --cache-type-v setting that you'd usually set to Q8_0 unless there was a good reason to do otherwise, in exllamav2 this is cache_4bit or cache_8bit.

Using cache-type-k and v at Q8_0 reduces (v)RAM usage by the context by 1/2 at no measurable impact to quality.

(As per ⁠https://discord.com/channels/1110598183144399058/1302371015292227616)

I'd submit a PR to add the ability to pass the functionality down to llamacpp and MLX as I have done for Ollama, but lmstudio's source is closed.

phazei commented 1 week ago

I was just looking for a setting for this. With the new Qwen 32B Coder, not being able to run a lower context cache quant is really dissappointing.