If I want to run a linear layer with q4_k_m on cpu using lamafile, how to do it with your implement

You are correct. For testing convenience, we randomly generated some FP32 weights in our bench program and let llamafile_sgemm treat them as quantized weights. The output obtained this way is meaningless; we use this bench program solely to test performance. In actual use, the weights are mmap-ed from gguf rather than randomly generated, ensuring that the computation results are meaningful.

The injection of the linear layer is coming soon, and you will then be able to run a linear layer on the CPU using llamafile by modifying a YAML configuration file.

kvcache-ai / ktransformers

If I want to run a linear layer with q4_k_m on cpu using lamafile, how to do it with your implement #20