rayleizhu / vllm-ra

[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
https://arxiv.org/abs/2402.14808
Apache License 2.0
34 stars 3 forks source link

No Observable Speed Difference Found #1

Open daniyal214 opened 8 months ago

daniyal214 commented 8 months ago

@rayleizhu I couldn't discern any difference in speed between using enable_relay_attention = True and enable_relay_attention = False.

I am using the same inference code (inference.py) as mentioned in the repo. Both process the output in same speed (around 50 tokens/speed) in my case with tensor_parallel_size=2

Could you kindly advise me on what steps I might take to address this issue? Additionally, I would appreciate any guidance on potential implementation details that I might be overlooking.

My code

from vllm import LLM, SamplingParams
import time

system_prompt = """You are a knowledgeable and friendly travel advisor, dedicated to assisting customers in planning their dream vacations. For customer inquiries, provide recommendations faithfully according to the documents provided here. When suggesting travel destinations or accommodations, ensure to include relevant details and attach links to the recommended options.

Available travel destinations and accommodations are listed below:
<DOC>
Destination, Type, Price per night
Paris, Hotel, $200
Maui, Resort, $350
Tokyo, Airbnb, $150
Rome, Bed and Breakfast, $180
Bali, Villa, $250
</DOC>

Here are some real examples of successful recommendations:
<DOC>
A couple celebrating their anniversary wanted a romantic getaway. You suggested a boutique hotel in Florence, Italy, known for its charming ambiance and exquisite cuisine. The couple was delighted with the recommendation and booked their stay immediately.
A group of friends planning a budget-friendly trip sought your advice. You recommended a cozy cottage in the Scottish Highlands, offering picturesque views and affordable rates. The friends were thrilled with the suggestion and booked their trip for the following spring.
A solo traveler was looking for an adventurous experience. You suggested a jungle lodge in Costa Rica, offering eco-friendly accommodations and exciting excursions. The traveler appreciated the recommendation and booked their trip for the upcoming winter.
A family of five with young children wanted a beach vacation. You recommended a spacious beachfront condo in Destin, Florida, equipped with amenities for families and close to kid-friendly attractions. The family was thrilled with the suggestion and booked their stay for the summer holidays.
A group of colleagues planning a corporate retreat approached you for recommendations. You suggested a luxury villa in Tuscany, Italy, offering privacy, stunning views, and space for team-building activities. The colleagues were impressed with the suggestion and booked their retreat for the following fall.
</DOC>"""

prompts = [
    "I'm planning a family vacation for the summer. We're a family of four with two kids aged 8 and 10. Our budget is around $300 per night. Can you suggest some suitable destinations?",
    # "My partner and I are looking for a romantic getaway for our anniversary. We want somewhere with beautiful scenery and luxury accommodations. What do you recommend?",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024)
mode = 'relay'

model = '/home/jovyan/model_llama_7B/models--meta-llama--Llama-2-7b-chat-hf/snapshots/Llama-2-7b-chat-hf'
quant = None
enforce_eager = False

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>", "<</SYS>>"

sys_schema = "[INST] <<SYS>>\n{__SYS_PROMPT}\n<</SYS>>\n\n{__USR_PROMPT} [/INST]"
sys_schema_file = None
sys_prompt_file = None

if mode == 'concat':
    # Create an LLM.
    print("CONCAT>>>>>")
    llm = LLM(model=model, quantization=quant, enforce_eager=enforce_eager,
              tensor_parallel_size=2,
              enable_relay_attention=False,
              sys_prompt=system_prompt,
              sys_schema=sys_schema,
              sys_prompt_file=sys_prompt_file,
              sys_schema_file=sys_schema_file)
elif mode == 'relay':
    print("RELAy>>>>>")
    llm = LLM(model=model, quantization=quant, enforce_eager=enforce_eager,
              tensor_parallel_size=2,
              enable_relay_attention=True,
              sys_prompt=system_prompt,
              sys_schema=sys_schema,
              sys_prompt_file=sys_prompt_file,
              sys_schema_file=sys_schema_file)
else:
    raise ValueError(f'unknown mode {mode}')

print("STARTED>>>")
start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()
inference_time = end_time-start_time
print(f"Time Taken: {inference_time}sec")

output_tokens_lst = []
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n\nGenerated text: {generated_text!r}")
    output_token = len(output.outputs[0].token_ids)
    output_tokens_lst.append(output_token)

output_tokens = sum(output_tokens_lst)
token_speed = output_tokens/inference_time
print(f"Token Speed: {token_speed} tokens/sec")
print("---------------------------------------------------------")
print()
print()
rayleizhu commented 8 months ago

RelayAttention is designed for batched LLM serving. It is useful when there are many concurrent requests (e.g., in the cloud serving scenario, imagine you are OpenAI or Microsoft, who are hosting LLMs for many users).

See Section 5 (Limitations and Future Works) in the paper.

To test it, just repeat your prompts many times (e.g., prompts = prompts * 32), and then you should see the difference.

By the way, the example (Figure 2) in my paper is just for demonstration purposes, labels like <DOC> and </DOC> may be not recognizable by the models your are using.

rayleizhu commented 8 months ago

BTW, I'm also working on optimizations for device-side LLM inference; please stay tuned.

daniyal214 commented 8 months ago

Got it, thank you @rayleizhu for the response. I'll test it with many prompts. However, one concern arises: when I execute let's say prompts = prompts * 32 and then pass it to outputs = llm.generate(prompts, sampling_params), will this result in concurrent requests? Or are these requests processed sequentially?

rayleizhu commented 8 months ago

Got it, thank you @rayleizhu for the response. I'll test it with many prompts. However, one concern arises: when I execute let's say prompts = prompts * 32 and then pass it to outputs = llm.generate(prompts, sampling_params), will this result in concurrent requests? Or are these requests processed sequentially?

vLLM will schedule the requests to maximize concurrency. If you have enough GPU memory, it will run inference with a batch size as large as possible.

daniyal214 commented 8 months ago

Got it, thank you @rayleizhu for the response. I'll test it with many prompts. However, one concern arises: when I execute let's say prompts = prompts * 32 and then pass it to outputs = llm.generate(prompts, sampling_params), will this result in concurrent requests? Or are these requests processed sequentially?

vLLM will schedule the requests to maximize concurrency. If you have enough GPU memory, it will run inference with a batch size as large as possible.

Thanks!

daniyal214 commented 8 months ago

Hi, @rayleizhu I tried speed calculation both using enable_relay_attention = True and enable_relay_attention = False

I can see the difference. But I want to understand, is this the correct way to calculation tokens/sec because as said vLLM just concatenate all incoming requests together into a batched request at the token they're at and send them through, so it is like parallel request handling.

I'm calculating like this:

model = 'meta-llama/Llama-2-7b-chat-hf'

llm = LLM(model=model, quantization=None, enforce_eager=False,
          tensor_parallel_size=2,
          enable_relay_attention=True,
          sys_prompt=system_prompt,
          sys_schema=sys_schema,
          sys_prompt_file=sys_prompt_file,
          sys_schema_file=sys_schema_file)

print("STARTED>>>")
start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()
inference_time = end_time-start_time
print(f"Time Taken: {inference_time}sec")

output_tokens_lst = []
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n\nGenerated text: {generated_text!r}")
    output_token = len(output.outputs[0].token_ids)
    output_tokens_lst.append(output_token)

output_tokens = sum(output_tokens_lst)
token_speed = output_tokens/inference_time
print(f"Token Speed: {token_speed} tokens/sec")

with enable_relay_attention = True (sending 250 prompts) it seems to provide me with a rate of 1156.6691 tokens/sec, which appears to be an exceptionally high number. I'm uncertain whether this figure is accurate. Without enable_relay_attention, the rate drops to 627.076 tokens per second.

I'm unsure about the validity of these numbers because when I read comparison of models elsewhere, they typically report rates below 50 tokens per second, especially with an increase in concurrent requests. Hence, my numbers seem excessively high and may lack significance.

So, what is the correct method for calculating tokens per second, and is the approach I used above correct? If so, what is the appropriate interpretation of the results?

Thanks!!

rayleizhu commented 7 months ago

with enable_relay_attention = True (sending 250 prompts) it seems to provide me with a rate of 1156.6691 tokens/sec, which appears to be an exceptionally high number. I'm uncertain whether this figure is accurate. Without enable_relay_attention, the rate drops to 627.076 tokens per second.

The speedup of RelayAttention over the baseline looks reasonable.

I'm unsure about the validity of these numbers because when I read comparison of models elsewhere, they typically report rates below 50 tokens per second, especially with an increase in concurrent requests. Hence, my numbers seem excessively high and may lack significance.

So, what is the correct method for calculating tokens per second, and is the approach I used above correct? If so, what is the appropriate interpretation of the results?

I guess the blog you mentioned uses a different way to calculate the throughput. Typically, the more concurrent requests are, the higher the throughput should be due to better utilization of the GPUs. Therefore, the throughputs provided in the blog may have been normalized by the batch size (i.e., they use per-request average throughput). Besides, the model size (e.g. 7B or 13B ...) and GPU model (e.g. A100 or RTX 3090) can also affect the throughputs.