Open gty111 opened 9 months ago
Same problem! Have you address this issue?
Not yet.
Same problem! Have you address this bug?
the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?
but when setting the prefix_pos<=15, it's running
the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?
Not yet.
but when setting the prefix_pos<=15, it's running
Since the block size is 16, VLLM won't cache prefix if prefix_pos<=15.
I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.
I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.
not work for me
same issue on V100, any update to support V100?
same problem with V100, is there a way to rely on the page attention kernel instead of the context_attention_fwd @caoshiyi ?
This might be a solution otherwise https://github.com/openai/triton/issues/1420#issuecomment-1485564996
As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.
same error on V100.
Same error on V100. Ubuntu 22, GPU A100 32G, Python 3.10, cuda 12.1, vllm 0.5.0.post1 Trition 2.3.0 Python Code: max_model_len, tp_size = 131072, 1 model_name = "THUDM/glm-4-9b-chat" prompt = [{"role": "user", "content": "你好"}]
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) llm = LLM( model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True, dtype='half', enable_chunked_prefill=True, max_num_batched_tokens=8192 ) stop_token_ids = [151329, 151336, 151338] sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)
inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text) Log: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 07-09 17:48:14 config.py:1222] Casting torch.bfloat16 to torch.float16. INFO 07-09 17:48:14 config.py:707] Chunked prefill is enabled (EXPERIMENTAL). INFO 07-09 17:48:14 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model/glm/glm-4-9b-chat) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 07-09 17:48:15 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 07-09 17:48:15 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 07-09 17:48:15 selector.py:51] Using XFormers backend. INFO 07-09 17:48:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 07-09 17:48:18 selector.py:51] Using XFormers backend. INFO 07-09 17:48:28 model_runner.py:160] Loading model weights took 17.5635 GB INFO 07-09 17:48:30 gpu_executor.py:83] # GPU blocks: 13067, # CPU blocks: 6553 Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]python: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.
which version will work? vllm==0.4.3 and triton==2.3.0 have the same error
similar problem with v100.
Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python==3.10 triton==2.3.0 vllm==0.5.1
This problem arises from https://github.com/triton-lang/triton/pull/2627/files . vllm implemented a fwd kernel in prefix_prefill.py
, thus triggering this issue. I think we should modify the _fwd_kernel in vllm/vllm/attention/ops/prefix_prefill.py
.
vllm-0.5.3
nvidia-nccl-cu12-2.20.5
outlines-0.0.46
torch-2.3.1
triton-2.3.1
xformers-0.0.27
Adding the following flag worked for me:
vllm serve ... --enable-chunked-prefill=False
There is a similar issue: https://github.com/vllm-project/vllm/issues/6723
On v100, --enable-chunked-prefill
and --enable-prefix-caching
might encounter this crash when doing serving, so these features might not be available on V100.
same issue here on v100 tesla 32gb
set enable_prefix_caching=False
and enable_chunked_prefill=False
worked in V100
When executing script
examples/offline_inference_with_prefix.py
, it will callcontext_attention_fwd
fromvllm.model_executor.layers.triton_kernel.prefix_prefill
, which triggered the following errorPlatform :
related to #1669