How to speed up Llama3-70B inference?

meta-llama / llama3

The official Meta Llama 3 GitHub site

Other

21.42k stars 2.14k forks source link

How to speed up Llama3-70B inference? #163

Closed yuanjunchai closed 2 weeks ago

yuanjunchai commented 1 month ago

Hi Llama3 team,

Could you help me figure out methods to speed up the 70B model inference time? It seems that only one content needs more than 50s to inference, and I have use TensorRT but not so apparent speeding up.

subramen commented 1 month ago

Hi, you could try using torch.compile(mode='reduce-overhead') to speed up inference with CUDA graphs. We have some examples using VLLM here: https://github.com/meta-llama/llama-recipes