I was trying to find where to set which quantisation to use for the K/V context cache and it seems you can't in LM Studio.
K/V cache quantisation is required to run models context efficiently by reducing vRAM usage, as such it allows running larger context sizes than if you're running the K/V at full fp16 which gives little to no benefit over Q8_0.
vLLM, Aphrodite-Engine, Llamacpp, Exllamav2, Mistral-RS, MLX etc... all support this.
When running llama.cpp directly this is the --cache-type-k and --cache-type-v setting that you'd usually set to Q8_0 unless there was a good reason to do otherwise, in exllamav2 this is cache_4bit or cache_8bit.
Using cache-type-k and v at Q8_0 reduces (v)RAM usage by the context by 1/2 at no measurable impact to quality.
I was trying to find where to set which quantisation to use for the K/V context cache and it seems you can't in LM Studio.
K/V cache quantisation is required to run models context efficiently by reducing vRAM usage, as such it allows running larger context sizes than if you're running the K/V at full fp16 which gives little to no benefit over Q8_0.
vLLM, Aphrodite-Engine, Llamacpp, Exllamav2, Mistral-RS, MLX etc... all support this.
When running llama.cpp directly this is the --cache-type-k and --cache-type-v setting that you'd usually set to Q8_0 unless there was a good reason to do otherwise, in exllamav2 this is cache_4bit or cache_8bit.
Using cache-type-k and v at Q8_0 reduces (v)RAM usage by the context by 1/2 at no measurable impact to quality.
(As per https://discord.com/channels/1110598183144399058/1302371015292227616)
I'd submit a PR to add the ability to pass the functionality down to llamacpp and MLX as I have done for Ollama, but lmstudio's source is closed.