I am learning the fastapi and vllm, and try to build my own llm api_server. But when I test my code with vllm serve, vllm show the powerful inference efficiency,
The api server and test code is shown below. witch code in my fastapi should improve?
Or how can I see how thevllm serve is handled?
def load_queries_from_file(file_path):
queries = []
with open(file_path, 'r', encoding='utf-8') as file:
for line in file:
data = json.loads(line)
user_message = data.get("messages", [])[0].get("content")
if user_message:
queries.append(user_message)
return queries
queries = load_queries_from_file('./dataset.jsonl')
async def send_query(query, semaphore):
async with semaphore:
async with httpx.AsyncClient() as client:
try:
response = await client.post(
f"http://{host}{port}/v1/chat/completions",
json={
"model": "model",
"messages": [{"role": "user", "content": query}],
"temperature": 0.7,
},
timeout=120
)
except Exception as e:
print(f"Error: {e}")
else:
print(f"Query: {query}, Response: {response.text}")
async def test_concurrent_users(num_users, max_concurrent=100):
tasks = []
semaphore = asyncio.Semaphore(max_concurrent)
for _ in range(num_users):
query = random.choice(queries)
task = asyncio.create_task(send_query(query, semaphore))
tasks.append(task)
test_start_time = time.perf_counter()
await asyncio.gather(*tasks)
test_end_time = time.perf_counter()
total_time = test_end_time - test_start_time
avg_processing_time = num_users / total_time if total_time > 0 else float('inf')
print(f"\n--- total_time: {total_time:.2f}s ---")
print(f"--- avg_processing_time: {avg_processing_time:.2f} per sec ---")
async def main():
user_counts = [100]
for count in user_counts:
print(f"\n--- test {count} user ---")
await test_concurrent_users(num_users=count, max_concurrent=100)
if __name__ == "__main__":
asyncio.run(main())
How would you like to use vllm
No response
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Your current environment
I am learning the fastapi and vllm, and try to build my own llm api_server. But when I test my code with
vllm serve
, vllm show the powerful inference efficiency,The api server and test code is shown below. witch code in my fastapi should improve? Or how can I see how the
vllm serve
is handled?api_server.py
test.py
How would you like to use vllm
No response
Before submitting a new issue...