vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.09k stars 3.82k forks source link

[Bug]: python3: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. Aborted (core dumped) #6723

Closed linzm1007 closed 1 month ago

linzm1007 commented 1 month ago

Your current environment

INFO 07-24 03:31:45 logger.py:36] Received request chat-d9aa01ce9bad4c01a22eb2d07e2c8392: prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n你是谁<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 882, 128007, 271, 57668, 21043, 112471, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None. INFO 07-24 03:31:45 async_llm_engine.py:173] Added request chat-d9aa01ce9bad4c01a22eb2d07e2c8392. python3: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. Aborted (core dumped)

🐛 Describe the bug

INFO 07-24 03:31:45 logger.py:36] Received request chat-d9aa01ce9bad4c01a22eb2d07e2c8392: prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n你是谁<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 882, 128007, 271, 57668, 21043, 112471, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None. INFO 07-24 03:31:45 async_llm_engine.py:173] Added request chat-d9aa01ce9bad4c01a22eb2d07e2c8392. python3: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. Aborted (core dumped)

linzm1007 commented 1 month ago

Tesla V100-PCIE-32GB

youkaichao commented 1 month ago

can you give the full command and the model you use?

linzm1007 commented 1 month ago

can you give the full command and the model you use?

python3 -m vllm.entrypoints.openai.api_server \ --model /data/mlops/model \ --served-model-name test \ --tensor-parallel-size 1 \ --host 0.0.0.0 \ --port 31000 \ --trust-remote-code \ --dtype half

youkaichao commented 1 month ago

which model are you serving?

linzm1007 commented 1 month ago

can you give the full command and the model you use?

Meta-Llama-3.1-8B-Instruct https://www.modelscope.cn/models/LLM-Research/Meta-Llama-3.1-8B-Instruct/files

youkaichao commented 1 month ago

can you try to use A100? V100 might not be supported for this model.

QwertyJack commented 1 month ago

What a pity!

V100 might not be supported for this model.

ZhaoShuang1995 commented 1 month ago

Same issue! ! Using v100 and vllm.v0.5.3.post1 serving model: Meta-Llama-3.1-8B-Instruct

linzm1007 commented 1 month ago

Same issue! ! Using v100 and vllm.v0.5.3.post1 serving model: Meta-Llama-3.1-8B-Instruct

试一下加上--max-model-len 20000 如果还报错改成100看看

QwertyJack commented 1 month ago

试一下加上--max-model-len 20000 如果还报错改成100看看

It works like a charm! Also tested with --enable-chunked-prefill but failed with the same mma error. So, the issue is --enable-chunked-prefill!

MazarineGlacier commented 1 month ago

This problem probably arises from https://github.com/triton-lang/triton/pull/2627/files . vllm implemented a fwd kernel in prefix_prefill.py, thus triggering this issue. To solve this problem, I think we should modify the _fwd_kernel in vllm/vllm/attention/ops/prefix_prefill.py.

can you try to use A100? V100 might not be supported for this model.

According to triton, only Ampere GPUs can use this "mma->mma feature", so V100 cannot use --enable-prefix-caching and --enable-chunked-prefill.

youkaichao commented 1 month ago

According to triton, only Ampere GPUs can use this "mma->mma feature", so V100 cannot use --enable-prefix-caching and --enable-chunked-prefill.

thanks for the answer! For V100 user, you might need to explicitly turn it off, by --enable-chunked-prefill=False

Jeffwan commented 1 week ago

it's better to improve the doc and let the users know the limitations. We encounter the same issue