vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.75k stars 3.41k forks source link

(Async) Batch request, OpenAI API server #1636

Closed AIApprentice101 closed 7 months ago

AIApprentice101 commented 8 months ago

On the Langchain website, it states vLLMOpenAI supports both batching and async batching. But I can't get it working. Can you help? Thank you.

A minimal example:

from langchain.llms import VLLMOpenAI

llm = VLLMOpenAI(
    openai_api_key="EMPTY",
    openai_api_base="http://localhost:8000/v1",
    model_name="TheBloke/Llama-2-7B-Chat-AWQ",
)
results = await llm.abatch(["Write me three sentences about AI."] * 10)

What I get is "prompts in a list are not supported".

juliaabrams commented 8 months ago

The error specifically states that "prompts in a list are not supported." It implies that passing a list of prompts, as you're doing with ["Write me three sentences about AI."] * 10, might not be allowed.

To use batching with Langchain's vLLMOpenAI, you might need to provide multiple prompts individually, rather than as a single list. Here's an example of how you might modify your code to accomplish this:

python: from langchain.llms import VLLMOpenAI

async def main(): llm = VLLMOpenAI( openai_api_key="YOUR_OPENAI_API_KEY", openai_api_base="http://localhost:8000/v1", model_name="TheBloke/Llama-2-7B-Chat-AWQ", )

prompts = ["Write me three sentences about AI."] * 10
results = []

for prompt in prompts:
    result = await llm.abatch([prompt])
    results.append(result)

print(results)

Make sure to run an async event loop import asyncio loop = asyncio.get_event_loop() loop.run_until_complete(main()) This code snippet maintains the same prompt repetition but sends each prompt individually to the abatch() method. This approach might help avoid the error related to unsupported prompt lists.

Remember to replace "YOUR_OPENAI_API_KEY" with your actual OpenAI API key for this code to work correctly.

AIApprentice101 commented 8 months ago

@juliaabrams thank you for the workaround. But my question is more on offline batching.

simon-mo commented 8 months ago

This looks like langchain is using OpenAI API to perform a batched request. Do you know whether this is standard OpenAI behavior? I have trouble finding it.

Currently, the internals of vLLM support batching and we expose it either using the offline interface (LLM.generate) or online server accept batching multiple concurrent individual requests (through OpenAI compatible server).

jpeig commented 8 months ago

@simon-mo

Using multiple individual requests via the API by calling llm.generate(), is significantly slower than adding the requests from a prompt list.

This is how the engine itself (https://github.com/vllm-project/vllm/blob/main/examples/llm_engine_example.py) handles batching and its more efficient:

        prompts: List[str] = prompt

        engine = llm.llm_engine
        request_id = 0
        results = []
        while prompts or engine.has_unfinished_requests():
            if prompts:
                prompt = prompts.pop(0)
                engine.add_request(str(request_id), prompt, sampling_params)
                request_id += 1

            request_outputs: List[RequestOutput] = engine.step()

            for request_output in request_outputs:
                if request_output.finished:
                    results.append(request_output.outputs[0].text)

        return results
simon-mo commented 8 months ago

I think we are talking about different API here?

If you send multiple concurrent requests, you can observe that they are not executed sequentially, rather, the decode happens together.

Here's an demo showing batching of concurrent requests

https://github.com/vllm-project/vllm/assets/21118851/77254716-4a11-450f-b583-c79ae294fcad

jpeig commented 8 months ago

Awesome - I may need to alter my tests in that case / the way I approach the API.

simon-mo commented 7 months ago

I'm closing this as a duplicate of #1707. Please refer to that issue for deeper explanation but the TL;DR is AWQ is not optimized. It still works for low throughput use case, delivering lower latency and memory savings.

You should also see this warning in the output, what you are observing is the effect of this:

WARNING 12-01 08:25:34 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
nlpkiddo-2001 commented 6 months ago

Hi , I am new to this vLLM, I need to make batch calls in vllm prompts = ["Once upon a time .."] * 10

Is vllm has native support for this? and if so is this good approach as like sending individual request concurrently? what would be tradeoff here?

thanks in advance

rbgo404 commented 3 months ago

Hey @simon-mo, Please bear with this question. So if I send multiple concurrent request to Async OpenAI Client then that will batch the request togther, are you saying that? Also, I want to see how does the batch size impacts the TTFT and TPS, for that I am using benchmarking_serve.py script.

There I can limit the request rate and request rate means how many concurrent request I can send to the OpenAI client toghether, does the request rate means batch size here?

roger-creus commented 3 months ago

I am having the same experience with the following code. Using the OpenAI api like that serves the requests sequentially, 1 at a time, very slowly (even using 4 GPUs with a small model like llama 7b). Is there any way currently to benefit from batched prompts? @simon-mo

import asyncio
import openai
from aiohttp import ClientSession

openai.api_base = '<my_local_host>'
openai.api_key = '<my_token>'

model_id = openai.Model.list()['data'][0]['id']
print(f'Model ID: {model_id}')

test_prompts = [
    "Describe the lifecycle of a butterfly.",
    "How do magnets work?",
    "Explain the theory of relativity in simple terms.",
    "What's the significance of the Mona Lisa?",
    "How does photosynthesis work?",
    "Compare and contrast Shakespeare's sonnets.",
    "Why is the sky blue during daytime?",
    "Discuss the importance of the Gutenberg press.",
    "What's the difference between mitosis and meiosis?",
    "How did the pyramids of Egypt get built?",
    "Summarize the plot of 'Moby Dick'.",
    "Describe the key principles of Renaissance art.",
    "How do volcanoes form?",
    "Explain the water cycle.",
    "Why do we dream?",
    "How does the internet work?",
    "Discuss the impact of the industrial revolution.",
    "What causes tides?",
    "Describe the events leading up to World War I.",
    "How does a microwave oven work?",
    "Why do we have leap years?",
    "Explain the concept of black holes.",
    "How are rainbows formed?",
    "Discuss the cultural impact of The Beatles.",
    "What's the role of mitochondria in a cell?",
    "Why is biodiversity important?",
    "Describe the process of fermentation.",
    "How did the Roman Empire fall?",
    "Explain the basics of quantum mechanics.",
    "What are the benefits of reading literature?",
    "How do planes fly?",
    "Why is gold valuable?",
    "Discuss the main tenets of Buddhism.",
    "How does the human eye work?",
    "Explain the process of nuclear fusion.",
    "What are the main causes of global warming?",
    "Describe the plot of 'Pride and Prejudice'.",
    "What's the difference between acids and bases?",
    "How are pearls formed?",
    "Discuss the impact of social media on society.",
    "Why do cats purr?",
    "Explain the concept of supply and demand.",
    "How does a compass work?",
    "What's the significance of the Eiffel Tower?",
    "Describe the history of the English language.",
    "Why do apples turn brown when cut?",
    "How were the Grand Canyons formed?",
    "Explain the principles of democracy.",
    "What are the pros and cons of nuclear energy?",
    "How does a bicycle stay upright?",
    "Discuss the themes in 'To Kill a Mockingbird'.",
    "What's the function of the heart in the human body?",
    "How does electricity get generated?",
    "Describe the moon's effect on Earth.",
    "What is the importance of the Amazon rainforest?",
    "Why do we get goosebumps?",
    "Explain the significance of the Magna Carta.",
    "How do clouds form?",
    "Discuss the legacy of Martin Luther King Jr.",
    "What's the difference between prokaryotic and eukaryotic cells?",
    "Why is the Dead Sea so salty?",
    "Describe the process of digestion.",
    "How did chocolate become popular worldwide?",
    "Explain the basics of artificial intelligence.",
    "Why do we have different time zones?",
    "What's the role of bees in an ecosystem?",
    "How does photosynthesis benefit animals?",
    "Discuss the cultural impact of jazz music.",
    "Explain the concept of gravity.",
    "Why do seasons change?",
    "Describe the symbolism in 'The Great Gatsby'.",
    "How do tsunamis occur?",
    "What are the key principles of communism?",
    "What's the importance of vaccination?",
    "Why does ice float on water?",
    "Explain how the printing press works.",
    "Discuss the history of tea and its global impact.",
    "How does the respiratory system function?",
    "Describe the origins of the Olympic Games.",
    "Why is recycling important?",
    "Explain the phenomenon of aurora borealis.",
    "What are the causes and effects of ozone depletion?",
    "Discuss the themes in 'Romeo and Juliet'.",
    "How do satellites orbit the Earth?",
    "Describe the formation of fossils.",
    "What's the role of the judiciary in a democracy?",
    "Why is the Great Wall of China significant?",
    "Explain the process of osmosis.",
    "What's the impact of the Silk Road in history?",
    "How does a computer's CPU work?",
    "Why do birds migrate?",
    "Discuss the impact of the French Revolution.",
    "What's the significance of Newton's three laws?",
    "Describe the history of the piano.",
    "How do we perceive colors?",
    "Explain the workings of a thermos.",
    "Why do dogs wag their tails?"
]

async def create_chat_completion(prompt: str, index=[1]):
    while 1:
        response = await openai.ChatCompletion.acreate(
            model=model_id,
            temperature=0.5,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )

        output = response.choices[0].message.content
        if output and output.rstrip():
            break

        await asyncio.sleep(1.0)
        print('RETRY')

    print(f'Q #{index[0]}: {prompt}')
    print(f'A #{index[0]}: {output.lstrip()}')
    print()

    index[0] += 1
    return output

async def main():
    session = ClientSession()
    openai.aiosession.set(session)

    tasks = [create_chat_completion(prompt) for prompt in test_prompts]
    pool = set()
    results = []
    while tasks or pool:
        if len(pool) < 32 and tasks:
            task = tasks.pop()
            pool.add(asyncio.create_task(task))
        done, pool = await asyncio.wait(pool, return_when=asyncio.FIRST_COMPLETED)
        for task in done:
            results.append(task.result())

    await openai.aiosession.get().close()

if __name__ == '__main__':
    asyncio.run(main())
simhallq commented 2 months ago

I am having the same experience with the following code. Using the OpenAI api like that serves the requests sequentially, 1 at a time, very slowly (even using 4 GPUs with a small model like llama 7b). Is there any way currently to benefit from batched prompts? @simon-mo

import asyncio
import openai
from aiohttp import ClientSession

openai.api_base = '<my_local_host>'
openai.api_key = '<my_token>'

model_id = openai.Model.list()['data'][0]['id']
print(f'Model ID: {model_id}')

test_prompts = [
    "Describe the lifecycle of a butterfly.",
    "How do magnets work?",
    "Explain the theory of relativity in simple terms.",
    "What's the significance of the Mona Lisa?",
    "How does photosynthesis work?",
    "Compare and contrast Shakespeare's sonnets.",
    "Why is the sky blue during daytime?",
    "Discuss the importance of the Gutenberg press.",
    "What's the difference between mitosis and meiosis?",
    "How did the pyramids of Egypt get built?",
    "Summarize the plot of 'Moby Dick'.",
    "Describe the key principles of Renaissance art.",
    "How do volcanoes form?",
    "Explain the water cycle.",
    "Why do we dream?",
    "How does the internet work?",
    "Discuss the impact of the industrial revolution.",
    "What causes tides?",
    "Describe the events leading up to World War I.",
    "How does a microwave oven work?",
    "Why do we have leap years?",
    "Explain the concept of black holes.",
    "How are rainbows formed?",
    "Discuss the cultural impact of The Beatles.",
    "What's the role of mitochondria in a cell?",
    "Why is biodiversity important?",
    "Describe the process of fermentation.",
    "How did the Roman Empire fall?",
    "Explain the basics of quantum mechanics.",
    "What are the benefits of reading literature?",
    "How do planes fly?",
    "Why is gold valuable?",
    "Discuss the main tenets of Buddhism.",
    "How does the human eye work?",
    "Explain the process of nuclear fusion.",
    "What are the main causes of global warming?",
    "Describe the plot of 'Pride and Prejudice'.",
    "What's the difference between acids and bases?",
    "How are pearls formed?",
    "Discuss the impact of social media on society.",
    "Why do cats purr?",
    "Explain the concept of supply and demand.",
    "How does a compass work?",
    "What's the significance of the Eiffel Tower?",
    "Describe the history of the English language.",
    "Why do apples turn brown when cut?",
    "How were the Grand Canyons formed?",
    "Explain the principles of democracy.",
    "What are the pros and cons of nuclear energy?",
    "How does a bicycle stay upright?",
    "Discuss the themes in 'To Kill a Mockingbird'.",
    "What's the function of the heart in the human body?",
    "How does electricity get generated?",
    "Describe the moon's effect on Earth.",
    "What is the importance of the Amazon rainforest?",
    "Why do we get goosebumps?",
    "Explain the significance of the Magna Carta.",
    "How do clouds form?",
    "Discuss the legacy of Martin Luther King Jr.",
    "What's the difference between prokaryotic and eukaryotic cells?",
    "Why is the Dead Sea so salty?",
    "Describe the process of digestion.",
    "How did chocolate become popular worldwide?",
    "Explain the basics of artificial intelligence.",
    "Why do we have different time zones?",
    "What's the role of bees in an ecosystem?",
    "How does photosynthesis benefit animals?",
    "Discuss the cultural impact of jazz music.",
    "Explain the concept of gravity.",
    "Why do seasons change?",
    "Describe the symbolism in 'The Great Gatsby'.",
    "How do tsunamis occur?",
    "What are the key principles of communism?",
    "What's the importance of vaccination?",
    "Why does ice float on water?",
    "Explain how the printing press works.",
    "Discuss the history of tea and its global impact.",
    "How does the respiratory system function?",
    "Describe the origins of the Olympic Games.",
    "Why is recycling important?",
    "Explain the phenomenon of aurora borealis.",
    "What are the causes and effects of ozone depletion?",
    "Discuss the themes in 'Romeo and Juliet'.",
    "How do satellites orbit the Earth?",
    "Describe the formation of fossils.",
    "What's the role of the judiciary in a democracy?",
    "Why is the Great Wall of China significant?",
    "Explain the process of osmosis.",
    "What's the impact of the Silk Road in history?",
    "How does a computer's CPU work?",
    "Why do birds migrate?",
    "Discuss the impact of the French Revolution.",
    "What's the significance of Newton's three laws?",
    "Describe the history of the piano.",
    "How do we perceive colors?",
    "Explain the workings of a thermos.",
    "Why do dogs wag their tails?"
]

async def create_chat_completion(prompt: str, index=[1]):
    while 1:
        response = await openai.ChatCompletion.acreate(
            model=model_id,
            temperature=0.5,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )

        output = response.choices[0].message.content
        if output and output.rstrip():
            break

        await asyncio.sleep(1.0)
        print('RETRY')

    print(f'Q #{index[0]}: {prompt}')
    print(f'A #{index[0]}: {output.lstrip()}')
    print()

    index[0] += 1
    return output

async def main():
    session = ClientSession()
    openai.aiosession.set(session)

    tasks = [create_chat_completion(prompt) for prompt in test_prompts]
    pool = set()
    results = []
    while tasks or pool:
        if len(pool) < 32 and tasks:
            task = tasks.pop()
            pool.add(asyncio.create_task(task))
        done, pool = await asyncio.wait(pool, return_when=asyncio.FIRST_COMPLETED)
        for task in done:
            results.append(task.result())

    await openai.aiosession.get().close()

if __name__ == '__main__':
    asyncio.run(main())

Same for me.

Even running batched requests using the Openai client.completions.create results in 2-3x slowdown compared to running offline inference.

from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt="""
        ### Instructions:
        My system prompt
        ### Input:
        {input}

        ### Response:
        """.strip()
data=["This is a test string.","This is a second test string."]
inputs=[prompt.format(input=_["input"]) for _ in data]
inputs=inputs*2000

def run_request(inputs):
    return client.completions.create(
    model="my-model",
    prompt=inputs)

run_request(inputs)

Any updates on this issue? @simon-mo

Thanks.