Closed sleepwalker2017 closed 9 months ago
We are foucsing on performance optimization currently, and the cuda kernels are only support 8-bit kv cache to avoid the management of multiple copies of code. And fp16 kv cache will be supported when the kernels are turning into stable.
I try to disable this feature by setting
--quantized_cache 0
when exporting model, but when loading model, it coredumps,I see this: