hao-ai-lab / LookaheadDecoding

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
https://arxiv.org/abs/2402.02057
Apache License 2.0
1.11k stars 65 forks source link

How about batching throughput and energy consumption #10

Open Light-of-Hers opened 10 months ago

Light-of-Hers commented 10 months ago

The experiments are conducted on an A100 GPU, but for server GPUs, the throughput of batching for service is also an important metric to evaluate. Additionally, when focusing on personal GPUs, such as the RTX 30x0 or 40x0 series, or even mobile platforms, energy usage may be a significant factor to consider. How does lade fare in terms of these metrics?

Viol2000 commented 10 months ago

Thanks for your interest!

Regarding your first question, our focus is on reducing latency, and lade might not be the best option for optimizing throughput. I suggest considering vllm for throughput-related tasks.

For your second question, indeed, there's potential for speedup with the 30x0 models. For instance, on an RTX 3090 GPU using an FP16 version of 7b model with MT-Bench, setting window=10 or 7 and level=5 can yield about a 1.4x speedup. This is achieved by trading off flops for fewer decoding steps. However, the extent of speedup might be limited on less powerful GPUs.

We're actively working on further optimizations to enhance performance.

Light-of-Hers commented 10 months ago

Thanks for your reply.

I'm pleased to hear that it could speed up the inference latency of LLM on PC GPUs. Unlike server GPUs like the A100, PC GPUs such as the 30x0 and 40x0 series have a much lower demand for online inference throughput compared to inference latency, where lade will likely play a more significant role.