The Actual Throughput of int8 Quantization is Significantly Lower than Baseline on A100

pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

BSD 3-Clause "New" or "Revised" License

5.6k stars 509 forks source link

The Actual Throughput of int8 Quantization is Significantly Lower than Baseline on A100 #207

Open crhcrhcrhcrh opened 2 days ago

crhcrhcrhcrh commented 2 days ago

When I use the Llama 7B model after int8 quantization for inference, my throughput is only around 42 tokens/s, which is far lower than the 155 tokens/s stated in the documentation. Below is my execution environment:

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] PyTorch version: 2.3.1+cu121 CUDA version: 12.1

crhcrhcrhcrh commented 2 days ago

The device I am using is the A100.