kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Apache License 2.0
745 stars 39 forks source link

How to infer quantized models on CPU&GPU #103

Closed shuzhang-pku closed 1 month ago

shuzhang-pku commented 1 month ago

I found that ktransformers first performs a dequantize operation when loading the weights. Due to DRAM limitations, I want to directly infer the model with quantized weights on CPU&GPU. How can I implement this?

Azure-Tang commented 1 month ago

The dequantised weight will be quantised again into marlin format to use marlin op (which is fast). So if you are using optimising rules that we provided, you are directly using quantized weights.