Open hibukipanim opened 1 month ago
We have an improved block manager which has better test coverage for prefix caching. We have tests which compare equality of prefix caching vs non-prefix caching -- so this case shouldn't happen // if it is happening, we can more easily diagnose the failure. Note the v2 block manager is not yet optimized for performance.
Can you see if it occurs with --use-block-manager-v2
?
Thanks for the reply @cadedaniel.
I tried now with --use-v2-block-manager
(version 0.5.0.post1) and it still happens unfortunately.
Edit: Tried also building current main
branch (commit e2b85cf86a522e734a38b1d0314cfe9625003ef9) where https://github.com/vllm-project/vllm/pull/5364 is already merged, and the issue still happens (also with --use-v2-block-manager
)
Built also the branch of https://github.com/vllm-project/vllm/pull/5188 and it doesn't resolve the issue
possible workaround https://github.com/vllm-project/vllm/issues/5376#issuecomment-2179257676
Thanks @colefranks I tried and seems that the workaround doesn't seem to help but it does change the behavior, tried several combinations (all with version 0.5.0.post1).
On first iteration, there is difference in outputs between VLLM_ATTENTION_BACKEND=XFORMERS
and without. And if we assume that's ok, anyway when --enable-prefix-caching
is used, than second iteration with --enable-prefix-caching
differs from the first one.
is this issuse solved ? i meet the same problem, inconsistent completions .
The same thing happened when I replaced the model with Opt-125m and inferred offline. However, when I inserted torch.mannual_seed () (not random.seed) before generate, the result was correct.
@hibukipanim @kuangdao @SaltFish11 I sloved the problem by change the triton code. in this file ../triton/common/build.py cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda", add the "-std=c99", after the lines,like this if is_hip(): ret = subprocess.check_call([ cc, src, f"-I{hip_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC","-std=c99", f"-L{hip_lib_dir}", "-lamdhip64", "-o", so ]) else: cc_cmd = [ cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda","-std=c99", "-o", so ] cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs] ret = subprocess.check_call(cc_cmd)
thanks @bsll, but I struggle to understand what triton
you mean? there is no such folder in vLLM, do you mean in https://github.com/triton-lang/triton ? https://github.com/triton-inference-server/server? don't see a common/build.py
in either?
thanks @bsll, but I struggle to understand what
triton
you mean? there is no such folder in vLLM, do you mean in https://github.com/triton-lang/triton ? https://github.com/triton-inference-server/server? don't see acommon/build.py
in either?
thanks @bsll workaround. @hibukipanim the location is like /path/to/miniconda3/envs/vllm/lib/python3.9/site-packages/triton/common/build.py
Your current environment
š Describe the bug
Hi,
Seems that there is a dirty cache issue with
--enable-prefix-caching
. We noticed it as we saw internal eval scores significantly degrade when running with--enable-prefix-caching
and here I'll show how to reproduce it with a short snippet.Running 2 vLLM servers with:
without prefix caching:
and another with prefix caching:
Then running this snippet:
prints:
This happens also with 0.4.3. With 0.4.2 this snippet crashes the server with prefix-caching enabled.
Hopefully one of these PR resolves the issue š¤ :