Open daniyal214 opened 8 months ago
RelayAttention is designed for batched LLM serving. It is useful when there are many concurrent requests (e.g., in the cloud serving scenario, imagine you are OpenAI or Microsoft, who are hosting LLMs for many users).
See Section 5 (Limitations and Future Works) in the paper.
To test it, just repeat your prompts many times (e.g., prompts = prompts * 32
), and then you should see the difference.
By the way, the example (Figure 2) in my paper is just for demonstration purposes, labels like <DOC>
and </DOC>
may be not recognizable by the models your are using.
BTW, I'm also working on optimizations for device-side LLM inference; please stay tuned.
Got it, thank you @rayleizhu for the response. I'll test it with many prompts. However, one concern arises: when I execute let's say prompts = prompts * 32
and then pass it to outputs = llm.generate(prompts, sampling_params)
, will this result in concurrent requests? Or are these requests processed sequentially?
Got it, thank you @rayleizhu for the response. I'll test it with many prompts. However, one concern arises: when I execute let's say
prompts = prompts * 32
and then pass it tooutputs = llm.generate(prompts, sampling_params)
, will this result in concurrent requests? Or are these requests processed sequentially?
vLLM will schedule the requests to maximize concurrency. If you have enough GPU memory, it will run inference with a batch size as large as possible.
Got it, thank you @rayleizhu for the response. I'll test it with many prompts. However, one concern arises: when I execute let's say
prompts = prompts * 32
and then pass it tooutputs = llm.generate(prompts, sampling_params)
, will this result in concurrent requests? Or are these requests processed sequentially?vLLM will schedule the requests to maximize concurrency. If you have enough GPU memory, it will run inference with a batch size as large as possible.
Thanks!
Hi, @rayleizhu
I tried speed calculation both using enable_relay_attention = True
and enable_relay_attention = False
I can see the difference. But I want to understand, is this the correct way to calculation tokens/sec because as said vLLM just concatenate all incoming requests together into a batched request at the token they're at and send them through, so it is like parallel request handling.
I'm calculating like this:
model = 'meta-llama/Llama-2-7b-chat-hf'
llm = LLM(model=model, quantization=None, enforce_eager=False,
tensor_parallel_size=2,
enable_relay_attention=True,
sys_prompt=system_prompt,
sys_schema=sys_schema,
sys_prompt_file=sys_prompt_file,
sys_schema_file=sys_schema_file)
print("STARTED>>>")
start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()
inference_time = end_time-start_time
print(f"Time Taken: {inference_time}sec")
output_tokens_lst = []
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}\n\nGenerated text: {generated_text!r}")
output_token = len(output.outputs[0].token_ids)
output_tokens_lst.append(output_token)
output_tokens = sum(output_tokens_lst)
token_speed = output_tokens/inference_time
print(f"Token Speed: {token_speed} tokens/sec")
with enable_relay_attention = True
(sending 250 prompts) it seems to provide me with a rate of 1156.6691 tokens/sec, which appears to be an exceptionally high number. I'm uncertain whether this figure is accurate. Without enable_relay_attention, the rate drops to 627.076 tokens per second.
I'm unsure about the validity of these numbers because when I read comparison of models elsewhere, they typically report rates below 50 tokens per second, especially with an increase in concurrent requests. Hence, my numbers seem excessively high and may lack significance.
So, what is the correct method for calculating tokens per second, and is the approach I used above correct? If so, what is the appropriate interpretation of the results?
Thanks!!
with
enable_relay_attention = True
(sending 250 prompts) it seems to provide me with a rate of 1156.6691 tokens/sec, which appears to be an exceptionally high number. I'm uncertain whether this figure is accurate. Without enable_relay_attention, the rate drops to 627.076 tokens per second.
The speedup of RelayAttention over the baseline looks reasonable.
I'm unsure about the validity of these numbers because when I read comparison of models elsewhere, they typically report rates below 50 tokens per second, especially with an increase in concurrent requests. Hence, my numbers seem excessively high and may lack significance.
So, what is the correct method for calculating tokens per second, and is the approach I used above correct? If so, what is the appropriate interpretation of the results?
I guess the blog you mentioned uses a different way to calculate the throughput. Typically, the more concurrent requests are, the higher the throughput should be due to better utilization of the GPUs. Therefore, the throughputs provided in the blog may have been normalized by the batch size (i.e., they use per-request average throughput). Besides, the model size (e.g. 7B or 13B ...) and GPU model (e.g. A100 or RTX 3090) can also affect the throughputs.
@rayleizhu I couldn't discern any difference in speed between using
enable_relay_attention = True
andenable_relay_attention = False
.I am using the same inference code (inference.py) as mentioned in the repo. Both process the output in same speed (around 50 tokens/speed) in my case with
tensor_parallel_size=2
Could you kindly advise me on what steps I might take to address this issue? Additionally, I would appreciate any guidance on potential implementation details that I might be overlooking.
My code