kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Apache License 2.0
741 stars 39 forks source link

confused about the cpu memory. #19

Closed Eutenacity closed 3 months ago

Eutenacity commented 3 months ago

It seems that all experts tensor loaded from gguf will be dequantized into float32. It will take large CPU memory, is it right? Is it possible to save CPU memory?

"ktransformers/ktransformers/util/custom_gguf.py" "line 274 def load_gguf_tensor ..."