vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.61k stars 3.9k forks source link

[Performance]: MLP speculator #7748

Closed hustxiayang closed 2 weeks ago

hustxiayang commented 3 weeks ago

Proposal to improve performance

No response

Report of performance regression

hi, sorry to trouble! I compared the performance of llama3-8b-instruct with mlp-speculator(llama3-8b-accelerator) and non-speculative version llama3-8b-instruct through offline engine: LLM().generate(), but I did not see any performance improvement.

Details: For speculative decoding, I use llama3-8b-instruct as main model, llama3-8b-accelerator (https://huggingface.co/ibm-fms/llama3-8b-accelerator) as the speculative model. I set both speculative_draft_tensor_parallel_size and tensor_parallel_size as 1, use_v2_block_manager is set to be True, other parameters are set as default. On sampling, I set temperature as 0.0 and max_tokens as 100, other parameters are set as default. The GPU is NVIDIA H100 80GB HBM3, vllm version is 0.5.4, cuda version is 12. The prompt is the first question from mt_bench dataset, so the batch size should be 1.

Any insights on why there is no speedups observed?

A side question: It seems that if I initialize an LLM() within a function, it will not free the GPU memory and some other resources automatically, and I need to free the GPU memory manually, is this as expected? If so, is there any suggested complete procedure to avoid leaks?

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`
njhill commented 3 weeks ago

@hustxiayang how long are the prompts you’re using? There was an issue where the spec decoding gets disabled when the num of tokens is > 2k. This is now fixed and will be in the upcoming 0.5.5 release.

hustxiayang commented 3 weeks ago

Hi, @njhill ,thanks for response, it was just 22 tokens