I found that ktransformers first performs a dequantize operation when loading the weights. Due to DRAM limitations, I want to directly infer the model with quantized weights on CPU&GPU. How can I implement this?
The dequantised weight will be quantised again into marlin format to use marlin op (which is fast). So if you are using optimising rules that we provided, you are directly using quantized weights.
I found that ktransformers first performs a dequantize operation when loading the weights. Due to DRAM limitations, I want to directly infer the model with quantized weights on CPU&GPU. How can I implement this?