vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.27k stars 4.75k forks source link

[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8 #4381

Closed rkooo567 closed 3 months ago

rkooo567 commented 7 months ago

Your current environment

H100 (but I believe it happens in any machine)

🐛 Describe the bug

--enable-chunked-prefill --num-max-batched-tokens 2048 --kv-cache-dtype "fp8"

Seems to be broken with some type incompatibility error.

josephrocca commented 6 months ago

Yep, can confirm. I think it's undocumented that using both together is not supported? I get this error on a dual 4090 machine:

2024-06-03T14:15:05.332567820Z     raise CompilationError(fn.src, node, repr(e)) from e
2024-06-03T14:15:05.332573240Z triton.compiler.errors.CompilationError: at 114:24:        off_v = (
2024-06-03T14:15:05.332578110Z             bn[:, None] * stride_v_cache_bs +
2024-06-03T14:15:05.332588389Z             cur_kv_head * stride_v_cache_h +
2024-06-03T14:15:05.332593288Z             offs_d[None, :] * stride_v_cache_d +
2024-06-03T14:15:05.332598147Z             (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
2024-06-03T14:15:05.332602996Z         k = tl.load(K_cache + off_k,
2024-06-03T14:15:05.332607825Z                     mask=dim_mask[:, None] &
2024-06-03T14:15:05.332612695Z                     ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
2024-06-03T14:15:05.332617564Z                     other=0.0)  # [D,N]
2024-06-03T14:15:05.332622383Z 
2024-06-03T14:15:05.332627142Z         qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
2024-06-03T14:15:05.332632532Z         qk += tl.dot(q, k)
2024-06-03T14:15:05.332637411Z                         ^
2024-06-03T14:15:05.332642260Z AssertionError('Both operands must be same type. First operand (fp16) and second operand (uint8)')

Some other engine args that I used, in case they're relevant:

--quantization gptq --dtype float16 --enforce-eager --tensor-parallel-size 2
rkooo567 commented 6 months ago

Let me make a PR to raise an error for now. cc @comaniac I believe you made this work before. Did you use kv cache dtype fp 8?

comaniac commented 6 months ago

It should work with xformers backend with paged attention, but I'm not sure if that works with GPTQ.

K-Mistele commented 4 months ago

Same issue here. I am using llama 3.1 8B which has a context length of 128k. Chunked prefill is automatically enabled for models over a certain sequence length (128k is over it) and I found that I had to set --enable-chunked-prefill False in order to use --kv-cache-dtype fp8

comaniac commented 4 months ago

That's not expected. I'll file a PR to automatically disable chunked prefill for now if fp8 kv-cache is enabled.

K-Mistele commented 4 months ago

I know it's super long but here's the full trace:

the full very long trace ``` (constellate-vllm) (venv) constellate@1-ai-appserver-staging:/mnt/disk/AI/constellate-vllm$ python -m vllm.entrypoints.openai.api_server --dtype half --kv-cache-dtype fp8 --model meta-llama/Meta-Llama-3.1-8B-Instruct --chat-template examples/tool_chat_template_llama_3_1.jinja --enable-auto-tool-choice --tool-call-parser llama3.1 INFO 08-05 23:12:08 api_server.py:370] vLLM API server version 0.5.3.post1 INFO 08-05 23:12:08 api_server.py:371] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='examples/tool_chat_template_llama_3_1.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=True, tool_call_parser='llama3.1', model='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='fp8', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 08-05 23:12:10 config.py:1439] Casting torch.bfloat16 to torch.float16. WARNING 08-05 23:12:11 config.py:1439] Casting torch.bfloat16 to torch.float16. INFO 08-05 23:12:11 config.py:483] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor WARNING 08-05 23:12:11 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. INFO 08-05 23:12:11 config.py:819] Chunked prefill is enabled with max_num_batched_tokens=512. INFO 08-05 23:12:11 llm_engine.py:174] Initializing an LLM engine (v0.5.3.post1) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Meta-Llama-3.1-8B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False) INFO 08-05 23:12:11 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 08-05 23:12:11 selector.py:54] Using XFormers backend. /mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") INFO 08-05 23:12:12 model_runner.py:720] Starting to load model meta-llama/Meta-Llama-3.1-8B-Instruct... INFO 08-05 23:12:13 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 08-05 23:12:13 selector.py:54] Using XFormers backend. INFO 08-05 23:12:13 weight_utils.py:225] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00system<|end_header_id|>\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined or tools is not none %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Environment: ipython\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {#- REMOVED BUILTIN TOOLS - NOT USED OR NEEDED FOR OPENAI COMPATIBILITY INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] #} INFO 08-05 23:12:40 chat_utils.py:53] INFO 08-05 23:12:40 chat_utils.py:53] {{- "Cutting Knowledge Date: December 2023\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Today Date: " + date_string if date_string else '5 Aug 2024' + "\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools is not none and not tools_in_user_message %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }} INFO 08-05 23:12:40 chat_utils.py:53] {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Do not use variables.\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- for t in tools %} INFO 08-05 23:12:40 chat_utils.py:53] {{- t | tojson(indent=4) }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {{- system_message }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eot_id|>" }} INFO 08-05 23:12:40 chat_utils.py:53] INFO 08-05 23:12:40 chat_utils.py:53] {#- Custom tools are passed in a user message with some extra guidance #} INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools_in_user_message and not tools is none %} INFO 08-05 23:12:40 chat_utils.py:53] {#- Extract the first user message so we can plug it in here #} INFO 08-05 23:12:40 chat_utils.py:53] {%- if messages | length != 0 %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set first_user_message = messages[0]['content']|trim %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set messages = messages[1:] %} INFO 08-05 23:12:40 chat_utils.py:53] {%- else %} INFO 08-05 23:12:40 chat_utils.py:53] {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Given the following functions, please respond with a JSON for a function call " }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "with its proper arguments that best answers the given prompt.\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Do not use variables.\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- for t in tools %} INFO 08-05 23:12:40 chat_utils.py:53] {{- t | tojson(indent=4) }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %} INFO 08-05 23:12:40 chat_utils.py:53] {{- first_user_message + "<|eot_id|>"}} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] INFO 08-05 23:12:40 chat_utils.py:53] {%- for message in messages %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %} INFO 08-05 23:12:40 chat_utils.py:53] {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }} INFO 08-05 23:12:40 chat_utils.py:53] {%- elif 'tool_calls' in message %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if not message.tool_calls|length == 1 %} INFO 08-05 23:12:40 chat_utils.py:53] {{- raise_exception("This model only supports single tool-calls at once!") }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set tool_call = message.tool_calls[0].function %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined and tool_call.name in builtin_tools %} INFO 08-05 23:12:40 chat_utils.py:53] {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|python_tag|>" + tool_call.name + ".call(" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- for arg_name, arg_val in tool_call.arguments | items %} INFO 08-05 23:12:40 chat_utils.py:53] {{- arg_name + '="' + arg_val + '"' }} INFO 08-05 23:12:40 chat_utils.py:53] {%- if not loop.last %} INFO 08-05 23:12:40 chat_utils.py:53] {{- ", " }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %} INFO 08-05 23:12:40 chat_utils.py:53] {{- ")" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- else %} INFO 08-05 23:12:40 chat_utils.py:53] {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}} INFO 08-05 23:12:40 chat_utils.py:53] {{- '{"name": "' + tool_call.name + '", ' }} INFO 08-05 23:12:40 chat_utils.py:53] {{- '"parameters": ' }} INFO 08-05 23:12:40 chat_utils.py:53] {{- tool_call.arguments | tojson }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "}" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined %} INFO 08-05 23:12:40 chat_utils.py:53] {#- This means we're in ipython mode #} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eom_id|>" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- else %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eot_id|>" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- elif message.role == "tool" or message.role == "ipython" %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- if message.content is mapping or message.content is iterable %} INFO 08-05 23:12:40 chat_utils.py:53] {{- message.content | tojson }} INFO 08-05 23:12:40 chat_utils.py:53] {%- else %} INFO 08-05 23:12:40 chat_utils.py:53] {{- message.content }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eot_id|>" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if add_generation_prompt %} INFO 08-05 23:12:40 chat_utils.py:53] {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 serving_chat.py:80] "Auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored. WARNING 08-05 23:12:40 serving_embedding.py:171] embedding_mode is False. Embedding API will not work. INFO 08-05 23:12:40 chat_utils.py:53] Using supplied chat template: INFO 08-05 23:12:40 chat_utils.py:53] {{- bos_token }} INFO 08-05 23:12:40 chat_utils.py:53] {%- if custom_tools is defined %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set tools = custom_tools %} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if not tools_in_user_message is defined %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set tools_in_user_message = true %} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if not date_string is defined %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set date_string = "26 Jul 2024" %} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if not tools is defined %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set tools = none %} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {#- This block extracts the system message, so we can slot it into the right place. #} INFO 08-05 23:12:40 chat_utils.py:53] {%- if messages[0]['role'] == 'system' %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set system_message = messages[0]['content']|trim %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set messages = messages[1:] %} INFO 08-05 23:12:40 chat_utils.py:53] {%- else %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set system_message = "" %} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] INFO 08-05 23:12:40 chat_utils.py:53] {#- System message + builtin tools #} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|start_header_id|>system<|end_header_id|>\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined or tools is not none %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Environment: ipython\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {#- REMOVED BUILTIN TOOLS - NOT USED OR NEEDED FOR OPENAI COMPATIBILITY INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] #} INFO 08-05 23:12:40 chat_utils.py:53] INFO 08-05 23:12:40 chat_utils.py:53] {{- "Cutting Knowledge Date: December 2023\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Today Date: " + date_string if date_string else '5 Aug 2024' + "\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools is not none and not tools_in_user_message %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }} INFO 08-05 23:12:40 chat_utils.py:53] {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Do not use variables.\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- for t in tools %} INFO 08-05 23:12:40 chat_utils.py:53] {{- t | tojson(indent=4) }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {{- system_message }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eot_id|>" }} INFO 08-05 23:12:40 chat_utils.py:53] INFO 08-05 23:12:40 chat_utils.py:53] {#- Custom tools are passed in a user message with some extra guidance #} INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools_in_user_message and not tools is none %} INFO 08-05 23:12:40 chat_utils.py:53] {#- Extract the first user message so we can plug it in here #} INFO 08-05 23:12:40 chat_utils.py:53] {%- if messages | length != 0 %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set first_user_message = messages[0]['content']|trim %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set messages = messages[1:] %} INFO 08-05 23:12:40 chat_utils.py:53] {%- else %} INFO 08-05 23:12:40 chat_utils.py:53] {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Given the following functions, please respond with a JSON for a function call " }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "with its proper arguments that best answers the given prompt.\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "Do not use variables.\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- for t in tools %} INFO 08-05 23:12:40 chat_utils.py:53] {{- t | tojson(indent=4) }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %} INFO 08-05 23:12:40 chat_utils.py:53] {{- first_user_message + "<|eot_id|>"}} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] INFO 08-05 23:12:40 chat_utils.py:53] {%- for message in messages %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %} INFO 08-05 23:12:40 chat_utils.py:53] {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }} INFO 08-05 23:12:40 chat_utils.py:53] {%- elif 'tool_calls' in message %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if not message.tool_calls|length == 1 %} INFO 08-05 23:12:40 chat_utils.py:53] {{- raise_exception("This model only supports single tool-calls at once!") }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- set tool_call = message.tool_calls[0].function %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined and tool_call.name in builtin_tools %} INFO 08-05 23:12:40 chat_utils.py:53] {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|python_tag|>" + tool_call.name + ".call(" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- for arg_name, arg_val in tool_call.arguments | items %} INFO 08-05 23:12:40 chat_utils.py:53] {{- arg_name + '="' + arg_val + '"' }} INFO 08-05 23:12:40 chat_utils.py:53] {%- if not loop.last %} INFO 08-05 23:12:40 chat_utils.py:53] {{- ", " }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %} INFO 08-05 23:12:40 chat_utils.py:53] {{- ")" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- else %} INFO 08-05 23:12:40 chat_utils.py:53] {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}} INFO 08-05 23:12:40 chat_utils.py:53] {{- '{"name": "' + tool_call.name + '", ' }} INFO 08-05 23:12:40 chat_utils.py:53] {{- '"parameters": ' }} INFO 08-05 23:12:40 chat_utils.py:53] {{- tool_call.arguments | tojson }} INFO 08-05 23:12:40 chat_utils.py:53] {{- "}" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined %} INFO 08-05 23:12:40 chat_utils.py:53] {#- This means we're in ipython mode #} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eom_id|>" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- else %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eot_id|>" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- elif message.role == "tool" or message.role == "ipython" %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- if message.content is mapping or message.content is iterable %} INFO 08-05 23:12:40 chat_utils.py:53] {{- message.content | tojson }} INFO 08-05 23:12:40 chat_utils.py:53] {%- else %} INFO 08-05 23:12:40 chat_utils.py:53] {{- message.content }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eot_id|>" }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %} INFO 08-05 23:12:40 chat_utils.py:53] {%- if add_generation_prompt %} INFO 08-05 23:12:40 chat_utils.py:53] {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }} INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %} INFO 08-05 23:12:40 api_server.py:346] Available routes are: INFO 08-05 23:12:40 api_server.py:351] Route: /openapi.json, Methods: HEAD, GET INFO 08-05 23:12:40 api_server.py:351] Route: /docs, Methods: HEAD, GET INFO 08-05 23:12:40 api_server.py:351] Route: /docs/oauth2-redirect, Methods: HEAD, GET INFO 08-05 23:12:40 api_server.py:351] Route: /redoc, Methods: HEAD, GET INFO 08-05 23:12:40 api_server.py:351] Route: /health, Methods: GET INFO 08-05 23:12:40 api_server.py:351] Route: /tokenize, Methods: POST INFO 08-05 23:12:40 api_server.py:351] Route: /detokenize, Methods: POST INFO 08-05 23:12:40 api_server.py:351] Route: /v1/models, Methods: GET INFO 08-05 23:12:40 api_server.py:351] Route: /version, Methods: GET INFO 08-05 23:12:40 api_server.py:351] Route: /v1/chat/completions, Methods: POST INFO 08-05 23:12:40 api_server.py:351] Route: /v1/completions, Methods: POST INFO 08-05 23:12:40 api_server.py:351] Route: /v1/embeddings, Methods: POST INFO: Started server process [3431585] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO 08-05 23:12:43 serving_chat.py:149] Created full prompt INFO 08-05 23:12:43 serving_chat.py:150] <|begin_of_text|><|start_header_id|>system<|end_header_id|> INFO 08-05 23:12:43 serving_chat.py:150] INFO 08-05 23:12:43 serving_chat.py:150] Environment: ipython INFO 08-05 23:12:43 serving_chat.py:150] Cutting Knowledge Date: December 2023 INFO 08-05 23:12:43 serving_chat.py:150] Today Date: 26 Jul 2024<|eot_id|><|start_header_id|>user<|end_header_id|> INFO 08-05 23:12:43 serving_chat.py:150] INFO 08-05 23:12:43 serving_chat.py:150] Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt. INFO 08-05 23:12:43 serving_chat.py:150] INFO 08-05 23:12:43 serving_chat.py:150] Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.Do not use variables. INFO 08-05 23:12:43 serving_chat.py:150] INFO 08-05 23:12:43 serving_chat.py:150] { INFO 08-05 23:12:43 serving_chat.py:150] "type": "function", INFO 08-05 23:12:43 serving_chat.py:150] "function": { INFO 08-05 23:12:43 serving_chat.py:150] "name": "get_current_weather", INFO 08-05 23:12:43 serving_chat.py:150] "description": "Get the current weather in a given location", INFO 08-05 23:12:43 serving_chat.py:150] "parameters": { INFO 08-05 23:12:43 serving_chat.py:150] "type": "object", INFO 08-05 23:12:43 serving_chat.py:150] "properties": { INFO 08-05 23:12:43 serving_chat.py:150] "city": { INFO 08-05 23:12:43 serving_chat.py:150] "type": "string", INFO 08-05 23:12:43 serving_chat.py:150] "description": "The city to find the weather for, e.g. 'San Francisco'" INFO 08-05 23:12:43 serving_chat.py:150] }, INFO 08-05 23:12:43 serving_chat.py:150] "state": { INFO 08-05 23:12:43 serving_chat.py:150] "type": "string", INFO 08-05 23:12:43 serving_chat.py:150] "description": "the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'" INFO 08-05 23:12:43 serving_chat.py:150] }, INFO 08-05 23:12:43 serving_chat.py:150] "unit": { INFO 08-05 23:12:43 serving_chat.py:150] "type": "string", INFO 08-05 23:12:43 serving_chat.py:150] "description": "The unit to fetch the temperature in", INFO 08-05 23:12:43 serving_chat.py:150] "enum": [ INFO 08-05 23:12:43 serving_chat.py:150] "celsius", INFO 08-05 23:12:43 serving_chat.py:150] "fahrenheit" INFO 08-05 23:12:43 serving_chat.py:150] ] INFO 08-05 23:12:43 serving_chat.py:150] } INFO 08-05 23:12:43 serving_chat.py:150] } INFO 08-05 23:12:43 serving_chat.py:150] } INFO 08-05 23:12:43 serving_chat.py:150] } INFO 08-05 23:12:43 serving_chat.py:150] } INFO 08-05 23:12:43 serving_chat.py:150] INFO 08-05 23:12:43 serving_chat.py:150] Hi! How are you doing today?<|eot_id|><|start_header_id|>assistant<|end_header_id|> INFO 08-05 23:12:43 serving_chat.py:150] INFO 08-05 23:12:43 serving_chat.py:150] I'm doing well! How can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|> INFO 08-05 23:12:43 serving_chat.py:150] INFO 08-05 23:12:43 serving_chat.py:150] Can you tell me what the weather will be in Dallas and San Francisco? I like fahrenheit.<|eot_id|><|start_header_id|>assistant<|end_header_id|> INFO 08-05 23:12:43 serving_chat.py:150] INFO 08-05 23:12:43 serving_chat.py:150] INFO 08-05 23:12:43 logger.py:36] Received request chat-30088efd6e3645e2b07ea083bb9d7446: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nEnvironment: ipython\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nGiven the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.\n\nRespond in the format {"name": function name, "parameters": dictionary of argument name and its value}.Do not use variables.\n\n{\n "type": "function",\n "function": {\n "name": "get_current_weather",\n "description": "Get the current weather in a given location",\n "parameters": {\n "type": "object",\n "properties": {\n "city": {\n "type": "string",\n "description": "The city to find the weather for, e.g. \'San Francisco\'"\n },\n "state": {\n "type": "string",\n "description": "the two-letter abbreviation for the state that the city is in, e.g. \'CA\' which would mean \'California\'"\n },\n "unit": {\n "type": "string",\n "description": "The unit to fetch the temperature in",\n "enum": [\n "celsius",\n "fahrenheit"\n ]\n }\n }\n }\n }\n}\n\nHi! How are you doing today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI\'m doing well! How can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan you tell me what the weather will be in Dallas and San Francisco? I like fahrenheit.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=130751, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 13013, 25, 6125, 27993, 198, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 128009, 128006, 882, 128007, 271, 22818, 279, 2768, 5865, 11, 4587, 6013, 449, 264, 4823, 369, 264, 734, 1650, 449, 1202, 6300, 6105, 430, 1888, 11503, 279, 2728, 10137, 382, 66454, 304, 279, 3645, 5324, 609, 794, 734, 836, 11, 330, 14105, 794, 11240, 315, 5811, 836, 323, 1202, 907, 7966, 5519, 539, 1005, 7482, 382, 517, 262, 330, 1337, 794, 330, 1723, 761, 262, 330, 1723, 794, 341, 286, 330, 609, 794, 330, 456, 11327, 70464, 761, 286, 330, 4789, 794, 330, 1991, 279, 1510, 9282, 304, 264, 2728, 3813, 761, 286, 330, 14105, 794, 341, 310, 330, 1337, 794, 330, 1735, 761, 310, 330, 13495, 794, 341, 394, 330, 9103, 794, 341, 504, 330, 1337, 794, 330, 928, 761, 504, 330, 4789, 794, 330, 791, 3363, 311, 1505, 279, 9282, 369, 11, 384, 1326, 13, 364, 24661, 13175, 42265, 394, 1173, 394, 330, 2513, 794, 341, 504, 330, 1337, 794, 330, 928, 761, 504, 330, 4789, 794, 330, 1820, 1403, 80468, 72578, 369, 279, 1614, 430, 279, 3363, 374, 304, 11, 384, 1326, 13, 364, 5158, 6, 902, 1053, 3152, 364, 46510, 42265, 394, 1173, 394, 330, 3928, 794, 341, 504, 330, 1337, 794, 330, 928, 761, 504, 330, 4789, 794, 330, 791, 5089, 311, 7963, 279, 9499, 304, 761, 504, 330, 9195, 794, 2330, 667, 330, 66, 41347, 761, 667, 330, 69, 49010, 702, 504, 5243, 394, 457, 310, 457, 286, 457, 262, 457, 633, 13347, 0, 2650, 527, 499, 3815, 3432, 30, 128009, 128006, 78191, 128007, 271, 40, 2846, 3815, 1664, 0, 2650, 649, 358, 1520, 499, 30, 128009, 128006, 882, 128007, 271, 6854, 499, 3371, 757, 1148, 279, 9282, 690, 387, 304, 19051, 323, 5960, 13175, 30, 358, 1093, 282, 49010, 13, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None. INFO 08-05 23:12:43 async_llm_engine.py:174] Added request chat-30088efd6e3645e2b07ea083bb9d7446. ERROR 08-05 23:12:45 async_llm_engine.py:57] Engine background task failed ERROR 08-05 23:12:45 async_llm_engine.py:57] Traceback (most recent call last): ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 35, in wrapper ERROR 08-05 23:12:45 async_llm_engine.py:57] return fn(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 1534, in dot ERROR 08-05 23:12:45 async_llm_engine.py:57] return semantic.dot(input, other, acc, input_precision, max_num_imprecise_acc, out_dtype, _builder) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1355, in dot ERROR 08-05 23:12:45 async_llm_engine.py:57] assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options) ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1328, in assert_dtypes_valid ERROR 08-05 23:12:45 async_llm_engine.py:57] assert lhs_dtype == rhs_dtype, f"First input ({lhs_dtype}) and second input ({rhs_dtype}) must have the same dtype!" ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] AssertionError: First input (fp16) and second input (uint8) must have the same dtype! ERROR 08-05 23:12:45 async_llm_engine.py:57] ERROR 08-05 23:12:45 async_llm_engine.py:57] The above exception was the direct cause of the following exception: ERROR 08-05 23:12:45 async_llm_engine.py:57] ERROR 08-05 23:12:45 async_llm_engine.py:57] Traceback (most recent call last): ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion ERROR 08-05 23:12:45 async_llm_engine.py:57] return_value = task.result() ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 642, in run_engine_loop ERROR 08-05 23:12:45 async_llm_engine.py:57] result = task.result() ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 585, in engine_step ERROR 08-05 23:12:45 async_llm_engine.py:57] request_outputs = await self.engine.step_async(virtual_engine) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 254, in step_async ERROR 08-05 23:12:45 async_llm_engine.py:57] output = await self.model_executor.execute_model_async( ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/executor/gpu_executor.py", line 159, in execute_model_async ERROR 08-05 23:12:45 async_llm_engine.py:57] output = await make_async(self.driver_worker.execute_model ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run ERROR 08-05 23:12:45 async_llm_engine.py:57] result = self.fn(*self.args, **self.kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/worker/worker_base.py", line 273, in execute_model ERROR 08-05 23:12:45 async_llm_engine.py:57] output = self.model_runner.execute_model( ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 08-05 23:12:45 async_llm_engine.py:57] return func(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/worker/model_runner.py", line 1363, in execute_model ERROR 08-05 23:12:45 async_llm_engine.py:57] hidden_or_intermediate_states = model_executable( ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 08-05 23:12:45 async_llm_engine.py:57] return self._call_impl(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 08-05 23:12:45 async_llm_engine.py:57] return forward_call(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 422, in forward ERROR 08-05 23:12:45 async_llm_engine.py:57] model_output = self.model(input_ids, positions, kv_caches, ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 08-05 23:12:45 async_llm_engine.py:57] return self._call_impl(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 08-05 23:12:45 async_llm_engine.py:57] return forward_call(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 322, in forward ERROR 08-05 23:12:45 async_llm_engine.py:57] hidden_states, residual = layer( ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 08-05 23:12:45 async_llm_engine.py:57] return self._call_impl(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 08-05 23:12:45 async_llm_engine.py:57] return forward_call(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 245, in forward ERROR 08-05 23:12:45 async_llm_engine.py:57] hidden_states = self.self_attn( ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 08-05 23:12:45 async_llm_engine.py:57] return self._call_impl(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 08-05 23:12:45 async_llm_engine.py:57] return forward_call(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 175, in forward ERROR 08-05 23:12:45 async_llm_engine.py:57] attn_output = self.attn(q, k, v, kv_cache, attn_metadata) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 08-05 23:12:45 async_llm_engine.py:57] return self._call_impl(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 08-05 23:12:45 async_llm_engine.py:57] return forward_call(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/attention/layer.py", line 98, in forward ERROR 08-05 23:12:45 async_llm_engine.py:57] return self.impl.forward(query, ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/attention/backends/xformers.py", line 603, in forward ERROR 08-05 23:12:45 async_llm_engine.py:57] out = PagedAttention.forward_prefix( ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix ERROR 08-05 23:12:45 async_llm_engine.py:57] context_attention_fwd( ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 08-05 23:12:45 async_llm_engine.py:57] return func(*args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd ERROR 08-05 23:12:45 async_llm_engine.py:57] _fwd_kernel[grid]( ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in ERROR 08-05 23:12:45 async_llm_engine.py:57] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 662, in run ERROR 08-05 23:12:45 async_llm_engine.py:57] kernel = self.compile( ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276, in compile ERROR 08-05 23:12:45 async_llm_engine.py:57] module = src.make_ir(options, codegen_fns, context) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113, in make_ir ERROR 08-05 23:12:45 async_llm_engine.py:57] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-05 23:12:45 async_llm_engine.py:57] triton.compiler.errors.CompilationError: at 114:14: ERROR 08-05 23:12:45 async_llm_engine.py:57] off_v = ( ERROR 08-05 23:12:45 async_llm_engine.py:57] bn[:, None] * stride_v_cache_bs + ERROR 08-05 23:12:45 async_llm_engine.py:57] cur_kv_head * stride_v_cache_h + ERROR 08-05 23:12:45 async_llm_engine.py:57] offs_d[None, :] * stride_v_cache_d + ERROR 08-05 23:12:45 async_llm_engine.py:57] (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl) ERROR 08-05 23:12:45 async_llm_engine.py:57] k = tl.load(K_cache + off_k, ERROR 08-05 23:12:45 async_llm_engine.py:57] mask=dim_mask[:, None] & ERROR 08-05 23:12:45 async_llm_engine.py:57] ((start_n + offs_n[None, :]) < cur_batch_ctx_len), ERROR 08-05 23:12:45 async_llm_engine.py:57] other=0.0) # [D,N] ERROR 08-05 23:12:45 async_llm_engine.py:57] ERROR 08-05 23:12:45 async_llm_engine.py:57] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N] ERROR 08-05 23:12:45 async_llm_engine.py:57] qk += tl.dot(q, k) ERROR 08-05 23:12:45 async_llm_engine.py:57] ^ Exception in callback _log_task_completion(error_callback=>)(, None)>) at /mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py:37 handle: >)(, None)>) at /mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py:37> Traceback (most recent call last): File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 35, in wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 1534, in dot return semantic.dot(input, other, acc, input_precision, max_num_imprecise_acc, out_dtype, _builder) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1355, in dot assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1328, in assert_dtypes_valid assert lhs_dtype == rhs_dtype, f"First input ({lhs_dtype}) and second input ({rhs_dtype}) must have the same dtype!" ^^^^^^^^^^^^^^^^^^^^^^ AssertionError: First input (fp16) and second input (uint8) must have the same dtype! The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion return_value = task.result() ^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 642, in run_engine_loop result = task.result() ^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 585, in engine_step request_outputs = await self.engine.step_async(virtual_engine) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 254, in step_async output = await self.model_executor.execute_model_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/executor/gpu_executor.py", line 159, in execute_model_async output = await make_async(self.driver_worker.execute_model ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/worker/worker_base.py", line 273, in execute_model output = self.model_runner.execute_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/worker/model_runner.py", line 1363, in execute_model hidden_or_intermediate_states = model_executable( ^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 422, in forward model_output = self.model(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 322, in forward hidden_states, residual = layer( ^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 245, in forward hidden_states = self.self_attn( ^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 175, in forward attn_output = self.attn(q, k, v, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/attention/layer.py", line 98, in forward return self.impl.forward(query, ^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/attention/backends/xformers.py", line 603, in forward out = PagedAttention.forward_prefix( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix context_attention_fwd( File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd _fwd_kernel[grid]( File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 662, in run kernel = self.compile( ^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276, in compile module = src.make_ir(options, codegen_fns, context) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113, in make_ir return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ triton.compiler.errors.CompilationError: at 114:14: off_v = ( bn[:, None] * stride_v_cache_bs + cur_kv_head * stride_v_cache_h + offs_d[None, :] * stride_v_cache_d + (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl) k = tl.load(K_cache + off_k, mask=dim_mask[:, None] & ((start_n + offs_n[None, :]) < cur_batch_ctx_len), other=0.0) # [D,N] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N] qk += tl.dot(q, k) ^ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3.11/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause. INFO 08-05 23:12:45 async_llm_engine.py:181] Aborted request chat-30088efd6e3645e2b07ea083bb9d7446. INFO: 10.3.10.164:65191 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__ return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__ await self.middleware_stack(scope, receive, send) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__ raise exc File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__ await self.app(scope, receive, _send) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__ await self.app(scope, receive, send) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__ await self.middleware_stack(scope, receive, send) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 72, in app response = await func(request) ^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/api_server.py", line 191, in create_chat_completion generator = await openai_serving_chat.create_chat_completion( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/serving_chat.py", line 241, in create_chat_completion generator = await self.chat_completion_full_generator( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/serving_chat.py", line 569, in chat_completion_full_generator async for res in result_generator: File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/rpc/client.py", line 216, in generate raise request_output triton.compiler.errors.CompilationError: at 114:14: off_v = ( bn[:, None] * stride_v_cache_bs + cur_kv_head * stride_v_cache_h + offs_d[None, :] * stride_v_cache_d + (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl) k = tl.load(K_cache + off_k, mask=dim_mask[:, None] & ((start_n + offs_n[None, :]) < cur_batch_ctx_len), other=0.0) # [D,N] qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N] qk += tl.dot(q, k) ^ INFO 08-05 23:12:50 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. INFO 08-05 23:13:00 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. ^CINFO 08-05 23:13:06 server.py:207] vLLM ZMQ RPC Server was interrupted. INFO 08-05 23:13:06 api_server.py:396] Gracefully stopping http server ```
K-Mistele commented 4 months ago

(on a v100 tesla hence the fp16 instead of bf16)

jon-chuang commented 3 months ago

I know it's super long but here's the full trace:

Seems like triton kernel issue, looks fixable. Let me take a look.

Also: https://github.com/vllm-project/vllm/blob/1f26efbb3a5e6dad0b98421dd697167c42a50629/vllm/attention/backends/xformers.py#L600

Also, is this comment still relevant?

https://github.com/vllm-project/vllm/blob/1f26efbb3a5e6dad0b98421dd697167c42a50629/vllm/worker/model_runner.py#L765

jon-chuang commented 3 months ago
jon-chuang commented 3 months ago

This PR is directly relevant as the solution: https://github.com/vllm-project/vllm/pull/3234 (although the PR is incomplete and incorrect)

Additional context on FP8 KV Cache: https://github.com/vllm-project/vllm/issues/4532 with solution: https://github.com/vllm-project/vllm/pull/4893

jon-chuang commented 3 months ago

Btw, why is this not on the testing path? Where should such a test be included as regression test?