openppl-public / ppl.llm.serving

Apache License 2.0
122 stars 13 forks source link

How to disable kv cache 8-bit quantization? #19

Closed sleepwalker2017 closed 9 months ago

sleepwalker2017 commented 10 months ago

I try to disable this feature by setting --quantized_cache 0 when exporting model, but when loading model, it coredumps,

I see this:

if (model_config_.cache_quant_bit != 8 && model_config_.cache_quant_group != 8) {
        LOG(ERROR) << "only support cache_quant_bit == 8 and cache_quant_group == 8";
        return RC_INVALID_VALUE;
    }
Alcanderian commented 10 months ago

We are foucsing on performance optimization currently, and the cuda kernels are only support 8-bit kv cache to avoid the management of multiple copies of code. And fp16 kv cache will be supported when the kernels are turning into stable.