microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table
MIT License
420 stars 32 forks source link

Question about running run_pipeline.py #17

Closed qiuqiu10 closed 3 weeks ago

qiuqiu10 commented 4 weeks ago

Hi, I'm currently using jetson agx orin to run T-MAC and I'm confused about this: in README i was instructed to run run_pipeline.py, which falls into the llama.cpp/build/main, i'm not sure the relation between T-MAC and llama.cpp and i'm not sure if this is running the optimizations in T-MAC or simply testing llama.cpp? As i inspect T-MAC is integrated into llama.cpp so i suppose llama.cpp is used to transform gguf file and then T-MAC is applied, i'm not sure if this is correct

I've already run the run_pipeline.py and due to the above reason i dont know how to proceed

kaleid-liner commented 4 weeks ago

T-MAC is only a kernel library to accelerate mixed precision GEMM. llama.cpp is a framework for inference. The current end-2-end inference is implemented by integrating T-MAC kernels into llama.cpp.

qiuqiu10 commented 4 weeks ago

Ok I got it. So can i suppose that i can get the result you show in README when running run_pipeline.py? For example using python tools/run_pipeline.py -nt 12 -o /path/to/BitNet-3B-b1.58 -m hf-bitnet-3b on my jetson agx orin and getting the more than 20 tokens/s as a result?

kaleid-liner commented 4 weeks ago

main is used to demo correct output. You should use llama-bench to benchmark inference throughput.