vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.82k stars 4.5k forks source link

Can vLLM handle concurrent request with FastAPI? #3248

Closed Strongorange closed 8 months ago

Strongorange commented 8 months ago

OS: ubuntu 20.04 (Google Colab) GPU : Nvidia T4 15GB, A100 40GB (Google Colab)

import nest_asyncio
from pyngrok import ngrok, conf
import uvicorn
from typing import Union
from fastapi import FastAPI
from langchain_community.llms import VLLM
from vllm import LLM, SamplingParams

# Create an VLLM.
llm_v = LLM(model="OrionStarAI/Orion-14B-Chat-Int4", trust_remote_code=True, quantization="AWQ", dtype="half", gpu_memory_utilization=0.8)

@app.get("/chat/a")
def chat():
  prompt = generate_prompt('넌 김치만두가 좋아 고기 만두가 좋아?')
  sampling_params = SamplingParams(temperature=0.4, top_p=0.5,  max_tokens=256)
  result = llm_v.generate(prompt, sampling_params)
  print(result[0].outputs[0].text)
  return {"Hello": result[0].outputs[0].text}

ngrok_tunnel = ngrok.connect(8000)
print('Public URL:', ngrok_tunnel.public_url)
nest_asyncio.apply()
uvicorn.run(app, port=8000)

The OrionStarAI/Orion-14B-Chat-Int4 quantization model is being tested in a FastAPI environment with vLLM.

When testing requests, there is no problem processing them one by one, but if another request is received before the answer is generated, the following error occurs.

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 193, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/usr/local/lib/python3.10/dist-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/lib/python3.10/asyncio/futures.py", line 285, in __await__
    yield self  # This tells Task to wait for completion.
  File "/usr/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
    future.result()
  File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "<ipython-input-1-a02ecb59944a>", line 67, in chat
    result = llm_v.generate(prompt, sampling_params)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 182, in generate
    return self._run_engine(use_tqdm)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 208, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 853, in step
    return self._process_model_outputs(output, scheduler_outputs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 756, in _process_model_outputs
    self._process_sequence_group_outputs(seq_group, outputs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 594, in _process_sequence_group_outputs
    parent_child_dict[sample.parent_seq_id].append(sample)
KeyError: 7
ywang96 commented 8 months ago

vLLM provides an OpenAI compatible API server that you can deploy with docker easily.

If you would really like to build your own API server to serve concurrent requests, you should be usingAsyncLLMEngine, and I would suggest you to look at the implementation in https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py and https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py on how to do so.

hmellor commented 8 months ago

Closing as @ywang96's answer is correct

shrijayan commented 5 months ago

vLLM provides an OpenAI compatible API server that you can deploy with docker easily.

If you would really like to build your own API server to serve concurrent requests, you should be usingAsyncLLMEngine, and I would suggest you to look at the implementation in https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py and https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py on how to do so.

Should I run this script openai/api_server.py to get the concurrency? and parallel processing of hundred request at a time?

hmellor commented 5 months ago

Yes

FaDavid98 commented 4 months ago

How can i impement it in a RAG system?

xKwan commented 4 months ago

I would like to know as well, I am using LlamaIndex Vllm: https://docs.llamaindex.ai/en/stable/api_reference/llms/vllm/

S-Kathirvel commented 3 months ago

I would like to know as well, I am using LlamaIndex Vllm: https://docs.llamaindex.ai/en/stable/api_reference/llms/vllm/

I too, did you find any solutions?