vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.08k stars 3.28k forks source link

[Bug]: prefix-caching: inconsistent completions #5543

Open hibukipanim opened 1 month ago

hibukipanim commented 1 month ago

Your current environment

vLLM version 0.5.0.post1

šŸ› Describe the bug

Hi,

Seems that there is a dirty cache issue with --enable-prefix-caching. We noticed it as we saw internal eval scores significantly degrade when running with --enable-prefix-caching and here I'll show how to reproduce it with a short snippet.

Running 2 vLLM servers with:

without prefix caching:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8001

and another with prefix caching:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8002 --enable-prefix-caching

Then running this snippet:

import string 
import random

import openai

vllms = {
    "no-prefix-caching": "http://localhost:8001/v1",
    "with-prefix-caching": "http://localhost:8002/v1",
}

random.seed(0)
prompts = []
for i in range(16):
    prompts.append(''.join(random.choices(string.ascii_lowercase + string.digits, k=512)))

runs = []
for run in range(2):
    print(f"\nšŸƒ run #{run+1}")

    completions = {k: [] for k in vllms.keys()}
    runs.append(completions)
    for name, endpoint in vllms.items():
        print(f"vLLM {name=}, {endpoint=}")
        client = openai.OpenAI(
            base_url=endpoint,
            api_key="foo"
        )

        for prompt in prompts:
            response = client.completions.create(
                    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    prompt=prompt,
                    temperature=0,
                    max_tokens=4,
            )
            completion = response.choices[0].text
            completions[name].append(completion)

        print(f"completions: {completions[name]}")

        if run > 0 and runs[run][name] != runs[run-1][name]:
            print(f"āŒ completions for vLLM {name=} differs from previous run!")

    if completions["with-prefix-caching"] != completions["no-prefix-caching"]:
        print("šŸ›‘ completions differ between with & without prefix")

prints:

šŸƒ run #1
vLLM name='no-prefix-caching', endpoint='http://localhost:8001/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
vLLM name='with-prefix-caching', endpoint='http://localhost:8002/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']

šŸƒ run #2
vLLM name='no-prefix-caching', endpoint='http://localhost:8001/v1'
completions: ['6x2w', 'zwg9v', 'xjuwf', 'hu5qw', 'jg0m', '1tzkb', '4w0q', '5zx5', 'zxqj', '7v16', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
vLLM name='with-prefix-caching', endpoint='http://localhost:8002/v1'
completions: ['6x2w', 'zwma71', '37wk', 'hu5qw', 'jg0m', '1tzkb', '4h7a', '5zq7', 'zxqj', '7k4n', '0ty57', 'vk0j', 'jjnj', 'xw95', 'vxjj', 't6x7']
āŒ completions for vLLM name='with-prefix-caching' differs from previous run!
šŸ›‘ completions differ between with & without prefix

This happens also with 0.4.3. With 0.4.2 this snippet crashes the server with prefix-caching enabled.

Hopefully one of these PR resolves the issue šŸ¤ž :

cadedaniel commented 1 month ago

We have an improved block manager which has better test coverage for prefix caching. We have tests which compare equality of prefix caching vs non-prefix caching -- so this case shouldn't happen // if it is happening, we can more easily diagnose the failure. Note the v2 block manager is not yet optimized for performance.

Can you see if it occurs with --use-block-manager-v2?

hibukipanim commented 1 month ago

Thanks for the reply @cadedaniel. I tried now with --use-v2-block-manager (version 0.5.0.post1) and it still happens unfortunately.

Edit: Tried also building current main branch (commit e2b85cf86a522e734a38b1d0314cfe9625003ef9) where https://github.com/vllm-project/vllm/pull/5364 is already merged, and the issue still happens (also with --use-v2-block-manager)

hibukipanim commented 1 month ago

Built also the branch of https://github.com/vllm-project/vllm/pull/5188 and it doesn't resolve the issue

colefranks commented 3 weeks ago

possible workaround https://github.com/vllm-project/vllm/issues/5376#issuecomment-2179257676

hibukipanim commented 3 weeks ago

Thanks @colefranks I tried and seems that the workaround doesn't seem to help but it does change the behavior, tried several combinations (all with version 0.5.0.post1).

On first iteration, there is difference in outputs between VLLM_ATTENTION_BACKEND=XFORMERS and without. And if we assume that's ok, anyway when --enable-prefix-caching is used, than second iteration with --enable-prefix-caching differs from the first one.

kuangdao commented 6 days ago

is this issuse solved ? i meet the same problem, inconsistent completions .

SaltFish11 commented 5 days ago

The same thing happened when I replaced the model with Opt-125m and inferred offline. However, when I inserted torch.mannual_seed () (not random.seed) before generate, the result was correct.

bsll commented 5 days ago

@hibukipanim @kuangdao @SaltFish11 I sloved the problem by change the triton code. in this file ../triton/common/build.py cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda", add the "-std=c99", after the lines,like this if is_hip(): ret = subprocess.check_call([ cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC","-std=c99", f"-L{hip_lib_dir}", "-lamdhip64", "-o", so ]) else: cc_cmd = [ cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda","-std=c99", "-o", so ] cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs] ret = subprocess.check_call(cc_cmd)

hibukipanim commented 3 days ago

thanks @bsll, but I struggle to understand what triton you mean? there is no such folder in vLLM, do you mean in https://github.com/triton-lang/triton ? https://github.com/triton-inference-server/server? don't see a common/build.py in either?

LLouice commented 2 days ago

thanks @bsll, but I struggle to understand what triton you mean? there is no such folder in vLLM, do you mean in https://github.com/triton-lang/triton ? https://github.com/triton-inference-server/server? don't see a common/build.py in either?

thanks @bsll workaround. @hibukipanim the location is like /path/to/miniconda3/envs/vllm/lib/python3.9/site-packages/triton/common/build.py