Could you help me figure out methods to speed up the 70B model inference time?
It seems that only one content needs more than 50s to inference, and I have use TensorRT but not so apparent speeding up.
Hi, you could try using torch.compile(mode='reduce-overhead') to speed up inference with CUDA graphs. We have some examples using VLLM here: https://github.com/meta-llama/llama-recipes
Hi Llama3 team,
Could you help me figure out methods to speed up the 70B model inference time? It seems that only one content needs more than 50s to inference, and I have use TensorRT but not so apparent speeding up.