kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Apache License 2.0
741 stars 39 forks source link

If I want to run a linear layer with q4_k_m on cpu using lamafile, how to do it with your implement #20

Closed Eutenacity closed 3 months ago

Eutenacity commented 3 months ago

I found the bench_linear.py. But the weight is fp32 not the quantized uint8. Can you give me a example ? thanks

chenht2022 commented 3 months ago

You are correct. For testing convenience, we randomly generated some FP32 weights in our bench program and let llamafile_sgemm treat them as quantized weights. The output obtained this way is meaningless; we use this bench program solely to test performance. In actual use, the weights are mmap-ed from gguf rather than randomly generated, ensuring that the computation results are meaningful.

The injection of the linear layer is coming soon, and you will then be able to run a linear layer on the CPU using llamafile by modifying a YAML configuration file.