Closed Eutenacity closed 3 months ago
You are correct. For testing convenience, we randomly generated some FP32 weights in our bench program and let llamafile_sgemm treat them as quantized weights. The output obtained this way is meaningless; we use this bench program solely to test performance. In actual use, the weights are mmap-ed from gguf rather than randomly generated, ensuring that the computation results are meaningful.
The injection of the linear layer is coming soon, and you will then be able to run a linear layer on the CPU using llamafile by modifying a YAML configuration file.
I found the bench_linear.py. But the weight is fp32 not the quantized uint8. Can you give me a example ? thanks