Closed linzm1007 closed 1 month ago
Tesla V100-PCIE-32GB
can you give the full command and the model you use?
can you give the full command and the model you use?
python3 -m vllm.entrypoints.openai.api_server \ --model /data/mlops/model \ --served-model-name test \ --tensor-parallel-size 1 \ --host 0.0.0.0 \ --port 31000 \ --trust-remote-code \ --dtype half
which model are you serving?
can you give the full command and the model you use?
Meta-Llama-3.1-8B-Instruct https://www.modelscope.cn/models/LLM-Research/Meta-Llama-3.1-8B-Instruct/files
can you try to use A100? V100 might not be supported for this model.
What a pity!
V100 might not be supported for this model.
Same issue! ! Using v100 and vllm.v0.5.3.post1 serving model: Meta-Llama-3.1-8B-Instruct
Same issue! ! Using v100 and vllm.v0.5.3.post1 serving model: Meta-Llama-3.1-8B-Instruct
试一下加上--max-model-len 20000 如果还报错改成100看看
试一下加上--max-model-len 20000 如果还报错改成100看看
It works like a charm! Also tested with --enable-chunked-prefill
but failed with the same mma
error.
So, the issue is --enable-chunked-prefill
!
This problem probably arises from https://github.com/triton-lang/triton/pull/2627/files . vllm implemented a fwd kernel in prefix_prefill.py
, thus triggering this issue. To solve this problem, I think we should modify the _fwd_kernel in vllm/vllm/attention/ops/prefix_prefill.py
.
can you try to use A100? V100 might not be supported for this model.
According to triton, only Ampere GPUs can use this "mma->mma feature", so V100 cannot use --enable-prefix-caching and --enable-chunked-prefill.
According to triton, only Ampere GPUs can use this "mma->mma feature", so V100 cannot use --enable-prefix-caching and --enable-chunked-prefill.
thanks for the answer! For V100 user, you might need to explicitly turn it off, by --enable-chunked-prefill=False
it's better to improve the doc and let the users know the limitations. We encounter the same issue
Your current environment
INFO 07-24 03:31:45 logger.py:36] Received request chat-d9aa01ce9bad4c01a22eb2d07e2c8392: prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n你是谁<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 882, 128007, 271, 57668, 21043, 112471, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None. INFO 07-24 03:31:45 async_llm_engine.py:173] Added request chat-d9aa01ce9bad4c01a22eb2d07e2c8392. python3: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
Aborted (core dumped)
🐛 Describe the bug
INFO 07-24 03:31:45 logger.py:36] Received request chat-d9aa01ce9bad4c01a22eb2d07e2c8392: prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n你是谁<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 882, 128007, 271, 57668, 21043, 112471, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None. INFO 07-24 03:31:45 async_llm_engine.py:173] Added request chat-d9aa01ce9bad4c01a22eb2d07e2c8392. python3: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
Aborted (core dumped)