vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.98k stars 3.81k forks source link

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. #2729

Open gty111 opened 7 months ago

gty111 commented 7 months ago

When executing script examples/offline_inference_with_prefix.py, it will call context_attention_fwd from vllm.model_executor.layers.triton_kernel.prefix_prefill, which triggered the following error

python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

Platform :

related to #1669

eric8607242 commented 6 months ago

Same problem! Have you address this issue?

gty111 commented 6 months ago

Not yet.

ikushare commented 5 months ago

Same problem! Have you address this bug?

ucasAliy commented 5 months ago

the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?

ucasAliy commented 5 months ago

but when setting the prefix_pos<=15, it's running

gty111 commented 5 months ago

the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?

Not yet.

but when setting the prefix_pos<=15, it's running

Since the block size is 16, VLLM won't cache prefix if prefix_pos<=15.

Qinyu-Xu commented 5 months ago

I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.

gty111 commented 5 months ago

I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.

not work for me

wenqf11 commented 5 months ago

same issue on V100, any update to support V100?

matthieu-zimmer commented 5 months ago

same problem with V100, is there a way to rely on the page attention kernel instead of the context_attention_fwd @caoshiyi ?

This might be a solution otherwise https://github.com/openai/triton/issues/1420#issuecomment-1485564996

matthieu-zimmer commented 3 months ago

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

naturomics commented 1 month ago

same error on V100.

wukonggeo commented 1 month ago

Same error on V100. Ubuntu 22, GPU A100 32G, Python 3.10, cuda 12.1, vllm 0.5.0.post1 Trition 2.3.0 Python Code: max_model_len, tp_size = 131072, 1 model_name = "THUDM/glm-4-9b-chat" prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) llm = LLM( model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True, dtype='half', enable_chunked_prefill=True, max_num_batched_tokens=8192 ) stop_token_ids = [151329, 151336, 151338] sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text) Log: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 07-09 17:48:14 config.py:1222] Casting torch.bfloat16 to torch.float16. INFO 07-09 17:48:14 config.py:707] Chunked prefill is enabled (EXPERIMENTAL). INFO 07-09 17:48:14 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model/glm/glm-4-9b-chat) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 07-09 17:48:15 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 07-09 17:48:15 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 07-09 17:48:15 selector.py:51] Using XFormers backend. INFO 07-09 17:48:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 07-09 17:48:18 selector.py:51] Using XFormers backend. INFO 07-09 17:48:28 model_runner.py:160] Loading model weights took 17.5635 GB INFO 07-09 17:48:30 gpu_executor.py:83] # GPU blocks: 13067, # CPU blocks: 6553 Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]python: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

tricky61 commented 1 month ago

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

which version will work? vllm==0.4.3 and triton==2.3.0 have the same error

ZG2017 commented 1 month ago

similar problem with v100.

Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

python==3.10 triton==2.3.0 vllm==0.5.1

MazarineGlacier commented 1 month ago

This problem arises from https://github.com/triton-lang/triton/pull/2627/files . vllm implemented a fwd kernel in prefix_prefill.py, thus triggering this issue. I think we should modify the _fwd_kernel in vllm/vllm/attention/ops/prefix_prefill.py.

geekchen007 commented 1 month ago

same problem with Nvidia-v100.

vllm-0.5.3 
nvidia-nccl-cu12-2.20.5
outlines-0.0.46 
torch-2.3.1 
triton-2.3.1 
xformers-0.0.27
grgkovac commented 4 weeks ago

Adding the following flag worked for me:

vllm serve ...  --enable-chunked-prefill=False
MazarineGlacier commented 3 weeks ago

There is a similar issue: https://github.com/vllm-project/vllm/issues/6723

On v100, --enable-chunked-prefill and --enable-prefix-caching might encounter this crash when doing serving, so these features might not be available on V100.