Closed qiuqiu10 closed 3 weeks ago
T-MAC is only a kernel library to accelerate mixed precision GEMM. llama.cpp is a framework for inference. The current end-2-end inference is implemented by integrating T-MAC kernels into llama.cpp.
Ok I got it. So can i suppose that i can get the result you show in README when running run_pipeline.py? For example using python tools/run_pipeline.py -nt 12 -o /path/to/BitNet-3B-b1.58 -m hf-bitnet-3b
on my jetson agx orin and getting the more than 20 tokens/s as a result?
main is used to demo correct output. You should use llama-bench to benchmark inference throughput.
Hi, I'm currently using jetson agx orin to run T-MAC and I'm confused about this: in README i was instructed to run run_pipeline.py, which falls into the llama.cpp/build/main, i'm not sure the relation between T-MAC and llama.cpp and i'm not sure if this is running the optimizations in T-MAC or simply testing llama.cpp? As i inspect T-MAC is integrated into llama.cpp so i suppose llama.cpp is used to transform gguf file and then T-MAC is applied, i'm not sure if this is correct
I've already run the run_pipeline.py and due to the above reason i dont know how to proceed