Open xiezhipeng-git opened 1 month ago
@DarkLight1337
Please follow these instructions on how to install for CPU.
Please follow these instructions on how to install for CPU.
why install cpu?and you say cpu is vllm?this function can use in windows?
and this pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
will replace torch? It is error.
Please follow these instructions on how to install for CPU.
why install cpu?and you say cpu is vllm?this function can use in windows? and this pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu will replace torch? It is error.
Based on your title, you originally have PyTorch for CPU installed and do not want the CUDA version to be installed, so I guess you want the CPU version of vLLM as well. Correct me if I'm wrong.
On Windows, you should be able to use vLLM via WSL if I recall correctly.
Please follow these instructions on how to install for CPU.请按照以下说明安装CPU。
why install cpu?and you say cpu is vllm?this function can use in windows? and this pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu will replace torch? It is error.
Based on your title, you originally have PyTorch for CPU installed and do not want the CUDA version to be installed, so I guess you want the CPU version of vLLM as well. Correct me if I'm wrong.根据你的标题,你最初安装了PyTorch for CPU,不想安装CUDA版本,所以我猜你也想要vLLM的CPU版本。如果我说错了请纠正我。
Of course not. It is said from beginning to end that the CUDA version has been replaced with the CPU version. Of course, all I want is the CUDA version。 如果你看得懂中文的话。还是用 中文交流吧。省的再有误解。从头到尾都说的是cuda版本被cpu替换了。并且我还说变成cpu版本这非常糟糕。只能重装cuda版本。中国的网络环境,这非常浪费时间。只要cuda版本 从来没有一次装过cpu版本的vllm .都是vllm 出错导致cuda版本被替换成了cpu。标题也是一个意思。 如果可能我不希望在使用wsl了。因为wsl有很大的概率导致会有未知情况下虚拟机崩溃后无法再次启动。然后整个虚拟机就没法用了。找wsl开发团队也没办法解决。100多个G的内容只能删掉了。完全浪费青春
Please follow these instructions on how to install for CPU.请按照以下说明安装CPU。
why install cpu?and you say cpu is vllm?this function can use in windows? and this pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu will replace torch? It is error.
Based on your title, you originally have PyTorch for CPU installed and do not want the CUDA version to be installed, so I guess you want the CPU version of vLLM as well. Correct me if I'm wrong.根据你的标题,你最初安装了PyTorch for CPU,不想安装CUDA版本,所以我猜你也想要vLLM的CPU版本。如果我说错了请纠正我。
Of course not. It is said from beginning to end that the CUDA version has been replaced with the CPU version. Of course, all I want is the CUDA version。 如果你看得懂中文的话。还是用 中文交流吧。省的再有误解。从头到尾都说的是cuda版本被cpu替换了。并且我还说变成cpu版本这非常糟糕。只能重装cuda版本。中国的网络环境,这非常浪费时间。只要cuda版本
Oh sorry, I somehow read it the other way round. vLLM only officially supports Linux OS so it might not be able to detect your CUDA from native Windows. I suggest using vLLM through WSL.
Please follow these instructions on how to install for CPU.请按照以下说明安装CPU。
why install cpu?and you say cpu is vllm?this function can use in windows? and this pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu will replace torch? It is error.
Based on your title, you originally have PyTorch for CPU installed and do not want the CUDA version to be installed, so I guess you want the CPU version of vLLM as well. Correct me if I'm wrong.根据你的标题,你最初安装了PyTorch for CPU,不想安装CUDA版本,所以我猜你也想要vLLM的CPU版本。如果我说错了请纠正我。
Of course not. It is said from beginning to end that the CUDA version has been replaced with the CPU version. Of course, all I want is the CUDA version。 如果你看得懂中文的话。还是用 中文交流吧。省的再有误解。从头到尾都说的是cuda版本被cpu替换了。并且我还说变成cpu版本这非常糟糕。只能重装cuda版本。中国的网络环境,这非常浪费时间。只要cuda版本
Oh sorry, I somehow read it the other way round. vLLM only officially supports Linux OS so it might not be able to detect your CUDA from native Windows. I suggest using vLLM through WSL.
wsl 有致命缺陷。如果可以我都不在愿意使用wsl了。你的问题检测不到torch cuda。这听起来像是pip 和cmake命令没有处理好。可以学习一下flashattntion .flashattntion 可以在windows上安装了。通过 不隔离环境 就可以找到torch 并编译安装。如果问题是这样的话,应该是不难解决的。 WSL has a fatal flaw. If possible, I am no longer willing to use WSL. Your issue cannot detect torch CUDA. This sounds like the pip and cmake commands were not handled properly. You can learn Flashatttion. Flashhattntion can be installed on Windows now. By not isolating the environment, torch can be found and compiled for installation. If that's the case, it shouldn't be difficult to solve.
@dtrifiro what's your opinion on supporting Windows? Is it feasible at this stage?
@DarkLight1337 @dtrifiro 而且之所以torch版本被替换成cpu .估计是因为你们对torch的版本需求写的是==2.4.0 。改成>=2.4.0.应该就不会有被自动重装的问题了。然后在考虑看看能否正式支持windows And the reason why the torch version was replaced with cpu .Probably because you guys wrote ==2.4.0 for the torch version requirement .Change it to >=2.4.0. and you shouldn't have the problem of being reinstalled automatically.Then we'll see if we can officially support windows.
@DarkLight1337 @dtrifiro 而且之所以torch版本被替换成cpu .估计是因为你们对torch的版本需求写的是==2.4.0 。改成>=2.4.0.应该就不会有被自动重装的问题了。然后在考虑看看能否正式支持windows And the reason why the torch version was replaced with cpu .Probably because you guys wrote ==2.4.0 for the torch version requirement .Change it to >=2.4.0. and you shouldn't have the problem of being reinstalled automatically.Then we'll see if we can officially support windows.
From my understanding, PyTorch installation should be able to automatically choose CPU/CUDA based on your machine. What happens if you just install torch==2.4.0
directly?
中国区网络环境不好。不想在折腾torch的版本。之前使用pip install vllm 安装成功以后。发现了torch 被替换成了cpu的这个问题。不过在装完cuda的torch以后我运行过一次vllm。然后报错了, 错误如下(不是很确定后来这个错误有没有被覆盖)
WARNING 10-24 21:42:41 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
ERROR 10-24 21:42:49 registry.py:267] Error in inspecting model architecture 'Qwen2ForCausalLM'
ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last):
ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 429, in _run_in_subprocess
ERROR 10-24 21:42:49 registry.py:267] returned.check_returncode()
ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\subprocess.py", line 457, in check_returncode
ERROR 10-24 21:42:49 registry.py:267] raise CalledProcessError(self.returncode, self.args, self.stdout,
ERROR 10-24 21:42:49 registry.py:267] subprocess.CalledProcessError: Command '['d:\my\env\python3.10.10\python.exe', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1.
ERROR 10-24 21:42:49 registry.py:267]
ERROR 10-24 21:42:49 registry.py:267] The above exception was the direct cause of the following exception:
ERROR 10-24 21:42:49 registry.py:267]
ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last):
ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 265, in _try_inspect_model_cls
ERROR 10-24 21:42:49 registry.py:267] return model.inspect_model_cls()
ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 227, in inspect_model_cls
ERROR 10-24 21:42:49 registry.py:267] return _run_in_subprocess(
ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 432, in _run_in_subprocess
ERROR 10-24 21:42:49 registry.py:267] raise RuntimeError(f"Error raised in subprocess:\n"
ERROR 10-24 21:42:49 registry.py:267] RuntimeError: Error raised in subprocess:
ERROR 10-24 21:42:49 registry.py:267] d:\my\env\python3.10.10\lib\runpy.py:126: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour
ERROR 10-24 21:42:49 registry.py:267] warn(RuntimeWarning(msg))
ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last):
ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\runpy.py", line 196, in _run_module_as_main
ERROR 10-24 21:42:49 registry.py:267] return _run_code(code, main_globals, None,
ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\runpy.py", line 86, in _run_code
ERROR 10-24 21:42:49 registry.py:267] exec(code, run_globals)
ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 453, in
File d:\my\env\python3.10.10\lib\site-packages\vllm\entrypoints\llm.py:177, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, mm_processor_kwargs, kwargs) 152 kwargs["disable_log_stats"] = True 154 engine_args = EngineArgs( 155 model=model, 156 tokenizer=tokenizer, (...) 175 kwargs, 176 ) --> 177 self.llm_engine = LLMEngine.from_engine_args( 178 engine_args, usage_context=UsageContext.LLM_CLASS) 179 self.request_counter = Counter()
File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\llm_engine.py:570, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers) 568 """Creates an LLM engine from the engine arguments.""" 569 # Create the engine configs. --> 570 engine_config = engine_args.create_engine_config() 571 executor_class = cls._get_executor_cls(engine_config) 572 # Create the LLM engine.
File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\arg_utils.py:903, in EngineArgs.create_engine_config(self) 898 assert self.cpu_offload_gb >= 0, ( 899 "CPU offload space must be non-negative" 900 f", but got {self.cpu_offload_gb}") 902 device_config = DeviceConfig(device=self.device) --> 903 model_config = self.create_model_config() 905 if model_config.is_multimodal_model: 906 if self.enable_prefix_caching:
File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\arg_utils.py:839, in EngineArgs.create_model_config(self) 838 def create_model_config(self) -> ModelConfig: --> 839 return ModelConfig( 840 model=self.model, 841 # We know this is not None because we set it in __post_init__ 842 tokenizer=cast(str, self.tokenizer), 843 tokenizer_mode=self.tokenizer_mode, 844 trust_remote_code=self.trust_remote_code, 845 dtype=self.dtype, 846 seed=self.seed, 847 revision=self.revision, 848 code_revision=self.code_revision, 849 rope_scaling=self.rope_scaling, 850 rope_theta=self.rope_theta, 851 tokenizer_revision=self.tokenizer_revision, 852 max_model_len=self.max_model_len, 853 quantization=self.quantization, 854 quantization_param_path=self.quantization_param_path, 855 enforce_eager=self.enforce_eager, 856 max_context_len_to_capture=self.max_context_len_to_capture, 857 max_seq_len_to_capture=self.max_seq_len_to_capture, 858 max_logprobs=self.max_logprobs, 859 disable_sliding_window=self.disable_sliding_window, 860 skip_tokenizer_init=self.skip_tokenizer_init, 861 served_model_name=self.served_model_name, 862 limit_mm_per_prompt=self.limit_mm_per_prompt, 863 use_async_output_proc=not self.disable_async_output_proc, 864 override_neuron_config=self.override_neuron_config, 865 config_format=self.config_format, 866 mm_processor_kwargs=self.mm_processor_kwargs, 867 )
File d:\my\env\python3.10.10\lib\site-packages\vllm\config.py:200, in ModelConfig.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, dtype, seed, revision, code_revision, rope_scaling, rope_theta, tokenizer_revision, max_model_len, spec_target_max_model_len, quantization, quantization_param_path, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, max_logprobs, disable_sliding_window, skip_tokenizer_init, served_model_name, limit_mm_per_prompt, use_async_output_proc, override_neuron_config, config_format, mm_processor_kwargs) 192 self.max_model_len = _get_and_verify_max_len( 193 hf_config=self.hf_text_config, 194 max_model_len=max_model_len, 195 disable_sliding_window=self.disable_sliding_window, 196 sliding_window_len=self.get_hf_config_sliding_window(), 197 spec_target_max_model_len=spec_target_max_model_len) 198 self.served_model_name = get_served_model_name(model, 199 served_model_name) --> 200 self.multimodal_config = self._init_multimodal_config( 201 limit_mm_per_prompt) 202 if not self.skip_tokenizer_init: 203 self._verify_tokenizer_mode()
File d:\my\env\python3.10.10\lib\site-packages\vllm\config.py:219, in ModelConfig._init_multimodal_config(self, limit_mm_per_prompt) 215 def _init_multimodal_config( 216 self, limit_mm_per_prompt: Optional[Mapping[str, int]] 217 ) -> Optional["MultiModalConfig"]: 218 architectures = getattr(self.hf_config, "architectures", []) --> 219 if ModelRegistry.is_multimodal_model(architectures): 220 return MultiModalConfig(limit_per_prompt=limit_mm_per_prompt or {}) 222 if limit_mm_per_prompt:
File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:387, in _ModelRegistry.is_multimodal_model(self, architectures) 383 def is_multimodal_model( 384 self, 385 architectures: Union[str, List[str]], 386 ) -> bool: --> 387 return self.inspect_model_cls(architectures).supports_multimodal
File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:356, in _ModelRegistry.inspect_model_cls(self, architectures) 353 if model_info is not None: 354 return model_info --> 356 return self._raise_for_unsupported(architectures)
File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:317, in _ModelRegistry._raise_for_unsupported(self, architectures) 314 def _raise_for_unsupported(self, architectures: List[str]): 315 all_supported_archs = self.get_supported_archs() --> 317 raise ValueError( 318 f"Model architectures {architectures} are not supported for now. " 319 f"Supported architectures: {all_supported_archs}")
ValueError: Model architectures ['Qwen2ForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'Gemma2Model', 'MistralModel', 'Qwen2ForRewardModel', 'Phi3VForCausalLM', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel']
ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 448, in _run ERROR 10-24 21:42:49 registry.py:267] with open(output_file, "wb") as f: ERROR 10-24 21:42:49 registry.py:267] PermissionError: [Errno 13] Permission denied: 'C:\Users\Admin\AppData\Local\Temp\tmp6cx7k05c'
It looks like this error I've encountered before: https://stackoverflow.com/questions/23212435/permission-denied-to-write-to-my-temporary-file. It can be solved by writing to a temporary directory instead, see if I can fix this real quick.
Hi @DarkLight1337 , can you take a look at this installation issue: #9180 thanks in advance
ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 448, in _run ERROR 10-24 21:42:49 registry.py:267] with open(output_file, "wb") as f: ERROR 10-24 21:42:49 registry.py:267] PermissionError: [Errno 13] Permission denied: 'C:\Users\Admin\AppData\Local\Temp\tmp6cx7k05c'
It looks like this error I've encountered before: https://stackoverflow.com/questions/23212435/permission-denied-to-write-to-my-temporary-file. It can be solved by writing to a temporary directory instead, see if I can fix this real quick.
Fixed. Feel free to reopen if you still encounter issues.
@DarkLight1337 has new error
TypeError Traceback (most recent call last)
Cell In[2], [line 5](vscode-notebook-cell:?execution_count=2&line=5)
[1](vscode-notebook-cell:?execution_count=2&line=1) from vllm import LLM, SamplingParams
[3](vscode-notebook-cell:?execution_count=2&line=3) # model_dir='Qwen2.5-14B-Instruct-GPTQ-Int4'
----> [5](vscode-notebook-cell:?execution_count=2&line=5) llm = LLM(model=model_dir,enforce_eager=True)
[6](vscode-notebook-cell:?execution_count=2&line=6) sampling_params = SamplingParams( top_p=0.9, max_tokens=512,top_k=10)
[8](vscode-notebook-cell:?execution_count=2&line=8) prompt = "1+1等于几"
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\utils.py:1023, in deprecate_args.<locals>.wrapper.<locals>.inner(*args, **kwargs)
[1016](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1016) msg += f" {additional_message}"
[1018](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1018) warnings.warn(
[1019](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1019) DeprecationWarning(msg),
[1020](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1020) stacklevel=3, # The inner function takes up one level
[1021](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1021) )
-> [1023](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1023) return fn(*args, **kwargs)
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\entrypoints\llm.py:198, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, mm_processor_kwargs, task, **kwargs)
[172](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:172) kwargs["disable_log_stats"] = True
[174](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:174) engine_args = EngineArgs(
[175](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:175) model=model,
[176](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:176) task=task,
(...)
[196](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:196) **kwargs,
[197](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:197) )
--> [198](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:198) self.llm_engine = LLMEngine.from_engine_args(
[199](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:199) engine_args, usage_context=UsageContext.LLM_CLASS)
[200](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:200) self.request_counter = Counter()
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\engine\llm_engine.py:582, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
[580](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:580) executor_class = cls._get_executor_cls(engine_config)
[581](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:581) # Create the LLM engine.
--> [582](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:582) engine = cls(
[583](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:583) **engine_config.to_dict(),
[584](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:584) executor_class=executor_class,
[585](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:585) log_stats=not engine_args.disable_log_stats,
[586](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:586) usage_context=usage_context,
[587](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:587) stat_loggers=stat_loggers,
[588](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:588) )
[590](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:590) return engine
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\engine\llm_engine.py:341, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers, input_registry, use_cached_outputs)
[337](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:337) self.input_registry = input_registry
[338](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:338) self.input_processor = input_registry.create_input_processor(
[339](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:339) model_config)
--> [341](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:341) self.model_executor = executor_class(
[342](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:342) model_config=model_config,
[343](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:343) cache_config=cache_config,
[344](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:344) parallel_config=parallel_config,
[345](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:345) scheduler_config=scheduler_config,
[346](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:346) device_config=device_config,
[347](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:347) lora_config=lora_config,
[348](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:348) speculative_config=speculative_config,
[349](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:349) load_config=load_config,
[350](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:350) prompt_adapter_config=prompt_adapter_config,
[351](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:351) observability_config=self.observability_config,
[352](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:352) )
[354](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:354) if self.model_config.task != "embedding":
[355](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:355) self._initialize_kv_caches()
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\executor_base.py:47, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, prompt_adapter_config, observability_config)
[45](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:45) self.prompt_adapter_config = prompt_adapter_config
[46](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:46) self.observability_config = observability_config
---> [47](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:47) self._init_executor()
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:38, in GPUExecutor._init_executor(self)
[33](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:33) """Initialize the worker and load the model.
[34](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:34) """
[35](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:35) assert self.parallel_config.world_size == 1, (
[36](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:36) "GPUExecutor only supports single GPU.")
---> [38](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:38) self.driver_worker = self._create_worker()
[39](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:39) self.driver_worker.init_device()
[40](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:40) self.driver_worker.load_model()
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:105, in GPUExecutor._create_worker(self, local_rank, rank, distributed_init_method)
[101](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:101) def _create_worker(self,
[102](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:102) local_rank: int = 0,
[103](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:103) rank: int = 0,
[104](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:104) distributed_init_method: Optional[str] = None):
--> [105](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:105) return create_worker(**self._get_create_worker_kwargs(
[106](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:106) local_rank=local_rank,
[107](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:107) rank=rank,
[108](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:108) distributed_init_method=distributed_init_method))
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:24, in create_worker(worker_module_name, worker_class_name, worker_class_fn, **kwargs)
[16](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:16) def create_worker(worker_module_name: str, worker_class_name: str,
[17](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:17) worker_class_fn: Optional[Callable[[], Type[WorkerBase]]],
[18](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:18) **kwargs):
[19](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:19) wrapper = WorkerWrapperBase(
[20](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:20) worker_module_name=worker_module_name,
[21](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:21) worker_class_name=worker_class_name,
[22](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:22) worker_class_fn=worker_class_fn,
[23](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:23) )
---> [24](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:24) wrapper.init_worker(**kwargs)
[25](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:25) return wrapper.worker
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\worker_base.py:449, in WorkerWrapperBase.init_worker(self, *args, **kwargs)
[446](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:446) mod = importlib.import_module(self.worker_module_name)
[447](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:447) worker_class = getattr(mod, self.worker_class_name)
--> [449](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:449) self.worker = worker_class(*args, **kwargs)
[450](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:450) assert self.worker is not None
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\worker.py:99, in Worker.__init__(self, model_config, parallel_config, scheduler_config, device_config, cache_config, load_config, local_rank, rank, distributed_init_method, lora_config, speculative_config, prompt_adapter_config, is_driver_worker, model_runner_cls, observability_config)
[97](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:97) elif self._is_encoder_decoder_model():
[98](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:98) ModelRunnerClass = EncoderDecoderModelRunner
---> [99](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:99) self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
[100](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:100) model_config,
[101](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:101) parallel_config,
[102](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:102) scheduler_config,
[103](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:103) device_config,
[104](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:104) cache_config,
[105](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:105) load_config=load_config,
[106](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:106) lora_config=self.lora_config,
[107](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:107) kv_cache_dtype=self.cache_config.cache_dtype,
[108](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:108) is_driver_worker=is_driver_worker,
[109](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:109) prompt_adapter_config=prompt_adapter_config,
[110](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:110) observability_config=observability_config,
[111](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:111) **speculative_args,
[112](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:112) )
[113](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:113) # Uninitialized cache engine. Will be initialized by
[114](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:114) # initialize_cache.
[115](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:115) self.cache_engine: List[CacheEngine]
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\model_runner.py:1013, in GPUModelRunnerBase.__init__(self, model_config, parallel_config, scheduler_config, device_config, cache_config, load_config, lora_config, kv_cache_dtype, is_driver_worker, prompt_adapter_config, return_hidden_states, observability_config, input_registry, mm_registry)
[1008](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1008) num_attn_heads = self.model_config.get_num_attention_heads(
[1009](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1009) self.parallel_config)
[1010](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1010) needs_attn_backend = (num_attn_heads != 0
[1011](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1011) or self.model_config.is_attention_free)
-> [1013](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1013) self.attn_backend = get_attn_backend(
[1014](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1014) self.model_config.get_head_size(),
[1015](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1015) self.model_config.dtype,
[1016](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1016) self.kv_cache_dtype,
[1017](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1017) self.block_size,
[1018](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1018) self.model_config.is_attention_free,
[1019](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1019) ) if needs_attn_backend else None
[1020](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1020) if self.attn_backend:
[1021](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1021) self.attn_state = self.attn_backend.get_state_cls()(
[1022](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1022) weakref.proxy(self))
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\attention\selector.py:120, in get_attn_backend(head_size, dtype, kv_cache_dtype, block_size, is_attention_free, is_blocksparse)
[118](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:118) if backend == _Backend.XFORMERS:
[119](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:119) logger.info("Using XFormers backend.")
--> [120](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:120) from vllm.attention.backends.xformers import ( # noqa: F401
[121](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:121) XFormersBackend)
[122](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:122) return XFormersBackend
[123](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:123) elif backend == _Backend.ROCM_FLASH:
File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\attention\backends\xformers.py:6
[3](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:3) from typing import Any, Dict, List, Optional, Tuple, Type
[5](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:5) import torch
----> [6](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:6) from xformers import ops as xops
[7](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:7) from xformers.ops.fmha.attn_bias import (AttentionBias,
[8](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:8) BlockDiagonalCausalMask,
[9](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:9) BlockDiagonalMask,
[10](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:10) LowerTriangularMaskWithTensorBias)
[12](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:12) from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl,
[13](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:13) AttentionMetadata, AttentionType)
File d:\my\env\python3.[1](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:1)0.10\lib\site-packages\xformers\ops\__init__.py:8
1 # Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
[2](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:2) #
[3](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:3) # This source code is licensed under the BSD license found in the
[4](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:4) # LICENSE file in the root directory of this source tree.
[6](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:6) import torch
----> [8](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:8) from .fmha import (
[9](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:9) AttentionBias,
[10](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:10) AttentionOp,
[11](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:11) AttentionOpBase,
[12](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:12) LowerTriangularMask,
[13](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:13) MemoryEfficientAttentionCkOp,
[14](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:14) MemoryEfficientAttentionCutlassFwdFlashBwOp,
[15](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:15) MemoryEfficientAttentionCutlassOp,
[16](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:16) MemoryEfficientAttentionFlashAttentionOp,
[17](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:17) MemoryEfficientAttentionSplitKCkOp,
[18](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:18) memory_efficient_attention,
[19](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:19) memory_efficient_attention_backward,
[20](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:20) memory_efficient_attention_forward,
[21](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:21) memory_efficient_attention_forward_requires_grad,
[22](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:22) )
[23](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:23) from .indexing import index_select_cat, scaled_index_add
[24](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:24) from .ipc import init_ipc
File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\__init__.py:10
[6](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:6) from typing import Any, List, Optional, Sequence, Tuple, Type, Union, cast
[8](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:8) import torch
---> [10](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:10) from . import (
[11](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:11) attn_bias,
[12](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:12) ck,
[13](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:13) ck_decoder,
[14](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:14) ck_splitk,
[15](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:15) cutlass,
[16](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:16) flash,
[17](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:17) flash3,
[18](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:18) triton_splitk,
[19](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:19) )
[20](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:20) from .attn_bias import VARLEN_BIASES, AttentionBias, LowerTriangularMask
[21](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:21) from .common import (
[22](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:22) AttentionBwOpBase,
[23](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:23) AttentionFwOpBase,
(...)
[29](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:29) bmk2bmhk,
[30](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:30) )
File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\triton_splitk.py:110
[94](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:94) return (
[95](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:95) super(InputsFp8, self).nbytes
[96](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:96) + (
(...)
[105](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:105) )
[106](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:106) )
[109](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:109) if TYPE_CHECKING or _is_triton_available():
--> [110](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:110) from ._triton.splitk_kernels import _fwd_kernel_splitK, _splitK_reduce
[111](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:111) else:
[112](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:112) _fwd_kernel_splitK = None
File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\_triton\splitk_kernels.py:632
[629](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:629) if sys.version_info >= (3, 9):
[630](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:630) # unroll_varargs requires Python 3.9+
[631](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:631) for num_groups in [1, 2, 4, 8]:
--> [632](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:632) _fwd_kernel_splitK_autotune[num_groups] = autotune_kernel(
[633](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:633) _get_splitk_kernel(num_groups)
[634](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:634) )
[636](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:636) def get_autotuner_cache(
[637](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:637) num_groups: int,
[638](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:638) ) -> Dict[Tuple[Union[int, str]], triton.Config]:
[639](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:639) """Returns a triton.runtime.autotuner.AutoTuner.cache object, which
[640](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:640) represents mappings from kernel autotune keys (tuples describing kernel inputs)
[641](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:641) to triton.Config
[642](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:642) """
File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\_triton\splitk_kernels.py:614, in autotune_kernel(kernel)
[604](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:604) WARPS_VALUES = [1, 2, 4]
[606](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:606) TRITON_CONFIGS = [
[607](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:607) gen_config(block_m, block_n, stages, warps)
[608](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:608) for block_m in BLOCK_M_VALUES
(...)
[611](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:611) for warps in WARPS_VALUES
[612](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:612) ]
--> [614](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:614) kernel = triton.autotune(
[615](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:615) configs=TRITON_CONFIGS,
[616](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:616) key=AUTOTUNER_KEY,
[617](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:617) use_cuda_graph=True,
[618](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:618) )(kernel)
[619](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:619) return kernel
TypeError: autotune() got an unexpected keyword argument 'use_cuda_graph'
@DarkLight1337 has new error
TypeError Traceback (most recent call last) Cell In[2], [line 5](vscode-notebook-cell:?execution_count=2&line=5) [1](vscode-notebook-cell:?execution_count=2&line=1) from vllm import LLM, SamplingParams [3](vscode-notebook-cell:?execution_count=2&line=3) # model_dir='Qwen2.5-14B-Instruct-GPTQ-Int4' ----> [5](vscode-notebook-cell:?execution_count=2&line=5) llm = LLM(model=model_dir,enforce_eager=True) [6](vscode-notebook-cell:?execution_count=2&line=6) sampling_params = SamplingParams( top_p=0.9, max_tokens=512,top_k=10) [8](vscode-notebook-cell:?execution_count=2&line=8) prompt = "1+1等于几" File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\utils.py:1023, in deprecate_args.<locals>.wrapper.<locals>.inner(*args, **kwargs) [1016](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1016) msg += f" {additional_message}" [1018](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1018) warnings.warn( [1019](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1019) DeprecationWarning(msg), [1020](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1020) stacklevel=3, # The inner function takes up one level [1021](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1021) ) -> [1023](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1023) return fn(*args, **kwargs) File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\entrypoints\llm.py:198, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, mm_processor_kwargs, task, **kwargs) [172](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:172) kwargs["disable_log_stats"] = True [174](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:174) engine_args = EngineArgs( [175](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:175) model=model, [176](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:176) task=task, (...) [196](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:196) **kwargs, [197](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:197) ) --> [198](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:198) self.llm_engine = LLMEngine.from_engine_args( [199](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:199) engine_args, usage_context=UsageContext.LLM_CLASS) [200](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:200) self.request_counter = Counter() File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\engine\llm_engine.py:582, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers) [580](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:580) executor_class = cls._get_executor_cls(engine_config) [581](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:581) # Create the LLM engine. --> [582](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:582) engine = cls( [583](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:583) **engine_config.to_dict(), [584](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:584) executor_class=executor_class, [585](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:585) log_stats=not engine_args.disable_log_stats, [586](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:586) usage_context=usage_context, [587](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:587) stat_loggers=stat_loggers, [588](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:588) ) [590](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:590) return engine File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\engine\llm_engine.py:341, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers, input_registry, use_cached_outputs) [337](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:337) self.input_registry = input_registry [338](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:338) self.input_processor = input_registry.create_input_processor( [339](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:339) model_config) --> [341](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:341) self.model_executor = executor_class( [342](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:342) model_config=model_config, [343](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:343) cache_config=cache_config, [344](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:344) parallel_config=parallel_config, [345](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:345) scheduler_config=scheduler_config, [346](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:346) device_config=device_config, [347](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:347) lora_config=lora_config, [348](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:348) speculative_config=speculative_config, [349](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:349) load_config=load_config, [350](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:350) prompt_adapter_config=prompt_adapter_config, [351](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:351) observability_config=self.observability_config, [352](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:352) ) [354](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:354) if self.model_config.task != "embedding": [355](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:355) self._initialize_kv_caches() File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\executor_base.py:47, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, prompt_adapter_config, observability_config) [45](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:45) self.prompt_adapter_config = prompt_adapter_config [46](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:46) self.observability_config = observability_config ---> [47](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:47) self._init_executor() File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:38, in GPUExecutor._init_executor(self) [33](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:33) """Initialize the worker and load the model. [34](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:34) """ [35](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:35) assert self.parallel_config.world_size == 1, ( [36](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:36) "GPUExecutor only supports single GPU.") ---> [38](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:38) self.driver_worker = self._create_worker() [39](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:39) self.driver_worker.init_device() [40](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:40) self.driver_worker.load_model() File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:105, in GPUExecutor._create_worker(self, local_rank, rank, distributed_init_method) [101](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:101) def _create_worker(self, [102](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:102) local_rank: int = 0, [103](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:103) rank: int = 0, [104](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:104) distributed_init_method: Optional[str] = None): --> [105](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:105) return create_worker(**self._get_create_worker_kwargs( [106](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:106) local_rank=local_rank, [107](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:107) rank=rank, [108](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:108) distributed_init_method=distributed_init_method)) File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:24, in create_worker(worker_module_name, worker_class_name, worker_class_fn, **kwargs) [16](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:16) def create_worker(worker_module_name: str, worker_class_name: str, [17](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:17) worker_class_fn: Optional[Callable[[], Type[WorkerBase]]], [18](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:18) **kwargs): [19](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:19) wrapper = WorkerWrapperBase( [20](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:20) worker_module_name=worker_module_name, [21](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:21) worker_class_name=worker_class_name, [22](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:22) worker_class_fn=worker_class_fn, [23](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:23) ) ---> [24](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:24) wrapper.init_worker(**kwargs) [25](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:25) return wrapper.worker File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\worker_base.py:449, in WorkerWrapperBase.init_worker(self, *args, **kwargs) [446](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:446) mod = importlib.import_module(self.worker_module_name) [447](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:447) worker_class = getattr(mod, self.worker_class_name) --> [449](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:449) self.worker = worker_class(*args, **kwargs) [450](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:450) assert self.worker is not None File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\worker.py:99, in Worker.__init__(self, model_config, parallel_config, scheduler_config, device_config, cache_config, load_config, local_rank, rank, distributed_init_method, lora_config, speculative_config, prompt_adapter_config, is_driver_worker, model_runner_cls, observability_config) [97](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:97) elif self._is_encoder_decoder_model(): [98](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:98) ModelRunnerClass = EncoderDecoderModelRunner ---> [99](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:99) self.model_runner: GPUModelRunnerBase = ModelRunnerClass( [100](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:100) model_config, [101](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:101) parallel_config, [102](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:102) scheduler_config, [103](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:103) device_config, [104](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:104) cache_config, [105](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:105) load_config=load_config, [106](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:106) lora_config=self.lora_config, [107](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:107) kv_cache_dtype=self.cache_config.cache_dtype, [108](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:108) is_driver_worker=is_driver_worker, [109](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:109) prompt_adapter_config=prompt_adapter_config, [110](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:110) observability_config=observability_config, [111](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:111) **speculative_args, [112](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:112) ) [113](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:113) # Uninitialized cache engine. Will be initialized by [114](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:114) # initialize_cache. [115](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:115) self.cache_engine: List[CacheEngine] File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\model_runner.py:1013, in GPUModelRunnerBase.__init__(self, model_config, parallel_config, scheduler_config, device_config, cache_config, load_config, lora_config, kv_cache_dtype, is_driver_worker, prompt_adapter_config, return_hidden_states, observability_config, input_registry, mm_registry) [1008](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1008) num_attn_heads = self.model_config.get_num_attention_heads( [1009](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1009) self.parallel_config) [1010](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1010) needs_attn_backend = (num_attn_heads != 0 [1011](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1011) or self.model_config.is_attention_free) -> [1013](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1013) self.attn_backend = get_attn_backend( [1014](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1014) self.model_config.get_head_size(), [1015](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1015) self.model_config.dtype, [1016](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1016) self.kv_cache_dtype, [1017](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1017) self.block_size, [1018](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1018) self.model_config.is_attention_free, [1019](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1019) ) if needs_attn_backend else None [1020](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1020) if self.attn_backend: [1021](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1021) self.attn_state = self.attn_backend.get_state_cls()( [1022](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1022) weakref.proxy(self)) File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\attention\selector.py:120, in get_attn_backend(head_size, dtype, kv_cache_dtype, block_size, is_attention_free, is_blocksparse) [118](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:118) if backend == _Backend.XFORMERS: [119](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:119) logger.info("Using XFormers backend.") --> [120](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:120) from vllm.attention.backends.xformers import ( # noqa: F401 [121](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:121) XFormersBackend) [122](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:122) return XFormersBackend [123](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:123) elif backend == _Backend.ROCM_FLASH: File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\attention\backends\xformers.py:6 [3](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:3) from typing import Any, Dict, List, Optional, Tuple, Type [5](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:5) import torch ----> [6](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:6) from xformers import ops as xops [7](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:7) from xformers.ops.fmha.attn_bias import (AttentionBias, [8](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:8) BlockDiagonalCausalMask, [9](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:9) BlockDiagonalMask, [10](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:10) LowerTriangularMaskWithTensorBias) [12](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:12) from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl, [13](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:13) AttentionMetadata, AttentionType) File d:\my\env\python3.[1](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:1)0.10\lib\site-packages\xformers\ops\__init__.py:8 1 # Copyright (c) Facebook, Inc. and its affiliates. All rights reserved. [2](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:2) # [3](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:3) # This source code is licensed under the BSD license found in the [4](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:4) # LICENSE file in the root directory of this source tree. [6](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:6) import torch ----> [8](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:8) from .fmha import ( [9](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:9) AttentionBias, [10](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:10) AttentionOp, [11](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:11) AttentionOpBase, [12](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:12) LowerTriangularMask, [13](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:13) MemoryEfficientAttentionCkOp, [14](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:14) MemoryEfficientAttentionCutlassFwdFlashBwOp, [15](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:15) MemoryEfficientAttentionCutlassOp, [16](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:16) MemoryEfficientAttentionFlashAttentionOp, [17](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:17) MemoryEfficientAttentionSplitKCkOp, [18](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:18) memory_efficient_attention, [19](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:19) memory_efficient_attention_backward, [20](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:20) memory_efficient_attention_forward, [21](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:21) memory_efficient_attention_forward_requires_grad, [22](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:22) ) [23](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:23) from .indexing import index_select_cat, scaled_index_add [24](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:24) from .ipc import init_ipc File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\__init__.py:10 [6](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:6) from typing import Any, List, Optional, Sequence, Tuple, Type, Union, cast [8](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:8) import torch ---> [10](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:10) from . import ( [11](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:11) attn_bias, [12](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:12) ck, [13](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:13) ck_decoder, [14](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:14) ck_splitk, [15](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:15) cutlass, [16](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:16) flash, [17](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:17) flash3, [18](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:18) triton_splitk, [19](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:19) ) [20](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:20) from .attn_bias import VARLEN_BIASES, AttentionBias, LowerTriangularMask [21](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:21) from .common import ( [22](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:22) AttentionBwOpBase, [23](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:23) AttentionFwOpBase, (...) [29](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:29) bmk2bmhk, [30](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:30) ) File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\triton_splitk.py:110 [94](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:94) return ( [95](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:95) super(InputsFp8, self).nbytes [96](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:96) + ( (...) [105](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:105) ) [106](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:106) ) [109](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:109) if TYPE_CHECKING or _is_triton_available(): --> [110](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:110) from ._triton.splitk_kernels import _fwd_kernel_splitK, _splitK_reduce [111](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:111) else: [112](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:112) _fwd_kernel_splitK = None File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\_triton\splitk_kernels.py:632 [629](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:629) if sys.version_info >= (3, 9): [630](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:630) # unroll_varargs requires Python 3.9+ [631](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:631) for num_groups in [1, 2, 4, 8]: --> [632](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:632) _fwd_kernel_splitK_autotune[num_groups] = autotune_kernel( [633](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:633) _get_splitk_kernel(num_groups) [634](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:634) ) [636](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:636) def get_autotuner_cache( [637](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:637) num_groups: int, [638](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:638) ) -> Dict[Tuple[Union[int, str]], triton.Config]: [639](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:639) """Returns a triton.runtime.autotuner.AutoTuner.cache object, which [640](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:640) represents mappings from kernel autotune keys (tuples describing kernel inputs) [641](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:641) to triton.Config [642](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:642) """ File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\_triton\splitk_kernels.py:614, in autotune_kernel(kernel) [604](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:604) WARPS_VALUES = [1, 2, 4] [606](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:606) TRITON_CONFIGS = [ [607](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:607) gen_config(block_m, block_n, stages, warps) [608](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:608) for block_m in BLOCK_M_VALUES (...) [611](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:611) for warps in WARPS_VALUES [612](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:612) ] --> [614](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:614) kernel = triton.autotune( [615](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:615) configs=TRITON_CONFIGS, [616](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:616) key=AUTOTUNER_KEY, [617](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:617) use_cuda_graph=True, [618](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:618) )(kernel) [619](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:619) return kernel TypeError: autotune() got an unexpected keyword argument 'use_cuda_graph'
This looks like a problem inside xformers. Maybe you should use other backends by setting VLLM_ATTENTION_BACKEND
(a list of options can be found here).
@DarkLight1337 like this? os.environ["VLLM_ATTENTION_BACKEND"] = "FLASH_ATTN" it is doesn't work WARNING 10-30 15:04:01 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") WARNING 10-30 15:04:08 config.py:438] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used INFO 10-30 15:04:08 llm_engine.py:243] Initializing an LLM engine (v0.6.3.post2.dev156+g04a3ae0a.d20241030) with config: model='C:\Users\Admin\.cache\modelscope\hub\Qwen\Qwen2_5-7B-Instruct', speculativeconfig=None, tokenizer='C:\Users\Admin\.cache\modelscope\hub\Qwen\Qwen25-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=C:\Users\Admin.cache\modelscope\hub\Qwen\Qwen2___5-7B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None)
INFO 10-30 15:04:09 selector.py:267] Cannot use FlashAttention-2 backend because the vllm.vllm_flash_attn package is not found. Make sure that vllm_flash_attn was built and installed (on by default). INFO 10-30 15:04:09 selector.py:119] Using XFormers backend. but I already install flash-attention
pip show flash-attn
Name: flash_attn
Version: 2.6.3
Summary: Flash Attention: Fast and Memory-Efficient Exact Attention
Home-page: https://github.com/Dao-AILab/flash-attention
Author: Tri Dao
Author-email: tri@tridao.me
License:
Location: d:\my\env\python3.10.10\lib\site-packages
Requires: einops, torch
Required-by:
PS D:\my\work\study\ai\kaggle_code\arc\kaggle_arc_2024>
Can you use pytorch SDPA?
what is pytorch SDPA? pip show torch Name: torch Version: 2.5.0+cu124 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3-Clause Location: d:\my\env\python3.10.10\lib\site-packages Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions Required-by: accelerate, auto_gptq, bitsandbytes, compressed-tensors, deepspeed, encodec, flash_attn, optimum, peft, stable-baselines3, timm , torchaudio, torchvision, trl, vector-quantize-pytorch, vocos, xformers
It is built into pytorch, so you should be able to use it as long as pytorch is installed.
Now the question is Cannot find FlashAttention-2 . I'm guessing it has nothing to do with sdpa.
has a sapa sample ? and this info "because the vllm.vllm_flash_attn package" I need install vllm other tool?
This is where I'm unable to really help you. I guess vLLM's flash attention package only works on Linux.
Maybe @dtrifiro can provide some insights here?
But flash-attion already supports windows.So does vllm's flashatten need to be regenerated.Or tell me how to generate it using the source code
I found different color and Found https://github.com/Dao-AILab/flash-attention/issues/1066 Is new version delete it?how to change? DarkLight1337 @dtrifiro
from flash_attn.flash_attn_interface import flash_attn_func
from flash_attn.flash_attn_interface import flash_attn_with_kvcache
import torch
def main():
batch_size = 2
seqlen_q = 1
seqlen_k = 1
nheads = 4
n_kv_heads = 2
d = 3
device = "cuda"
causal = True
window_size = (-1, -1)
dtype = torch.float16
paged_kv_cache_size = None
cache_seqlens = None
rotary_cos = None
rotary_sin = None
cache_batch_idx = None
block_table = None
softmax_scale = None
rotary_interleaved = False
alibi_slopes = None
num_splits = 0
max_seq_len = 3
if paged_kv_cache_size is None:
k_cache = torch.zeros(batch_size, max_seq_len, n_kv_heads, d, device=device, dtype=dtype)
v_cache = torch.zeros(batch_size, max_seq_len, n_kv_heads, d, device=device, dtype=dtype)
block_table = None
prev_q_vals = []
prev_k_vals = []
prev_v_vals = []
torch.manual_seed(0)
for i in range(0,3):
print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
q = torch.randn(batch_size, seqlen_q, nheads, d, device=device, dtype=dtype)
k = torch.randn(batch_size, seqlen_k, n_kv_heads, d, device=device, dtype=dtype)
v = torch.randn(batch_size, seqlen_k, n_kv_heads, d, device=device, dtype=dtype)
# kv cache
cache_seqlens = torch.tensor([i] * batch_size, dtype=torch.int32, device=device)
output_kvcache = flash_attn_with_kvcache(
q = q,
k_cache = k_cache,
v_cache = v_cache,
k = k,
v = v,
rotary_cos = rotary_cos,
rotary_sin = rotary_sin,
cache_seqlens = cache_seqlens,
cache_batch_idx = cache_batch_idx,
cache_leftpad = None,
block_table = block_table,
softmax_scale = softmax_scale,
causal = causal,
window_size = window_size,
softcap = 0.0,
rotary_interleaved = rotary_interleaved,
alibi_slopes = alibi_slopes,
num_splits = num_splits,
return_softmax_lse = False)
print(f"$$$ output KV CACHE MHA at {i} \n", output_kvcache)
# non kv cache MHA
prev_q_vals.append(q)
prev_k_vals.append(k)
prev_v_vals.append(v)
output_2 = flash_attn_func(
q=q,
k=torch.concat(prev_k_vals, axis=1),
v=torch.concat(prev_v_vals, axis=1),
dropout_p=0.0,
softmax_scale=None,
causal=causal,
window_size=window_size,
softcap=0.0,
alibi_slopes=None,
deterministic=False,
return_attn_probs=False)
print(f"!!! output MHA NON KV CACHE at {i} \n", output_2)
main()
$$$ output KV CACHE MHA at 0 tensor([[[[ 2.5449, -0.7163, -0.4934], [ 2.5449, -0.7163, -0.4934], [ 0.1267, 0.1014, -0.4036], [ 0.1267, 0.1014, -0.4036]]],
[[[ 0.9023, 0.8101, -0.6885],
[ 0.9023, 0.8101, -0.6885],
[ 0.1372, 1.0381, 0.0925],
[ 0.1372, 1.0381, 0.0925]]]], device='cuda:0', dtype=torch.float16)
!!! output MHA NON KV CACHE at 0 tensor([[[[ 2.5449, -0.7163, -0.4934], [ 2.5449, -0.7163, -0.4934], [ 0.1267, 0.1014, -0.4036], [ 0.1267, 0.1014, -0.4036]]],
[[[ 0.9023, 0.8101, -0.6885],
[ 0.9023, 0.8101, -0.6885],
[ 0.1372, 1.0381, 0.0925],
[ 0.1372, 1.0381, 0.0925]]]], device='cuda:0', dtype=torch.float16)
$$$ output KV CACHE MHA at 1 tensor([[[[ 1.8740, -0.3555, -0.2308], [ 1.8223, -0.3279, -0.2108], [ 0.6812, -0.3042, 0.1327], [ 0.8237, -0.4082, 0.2703]]],
[[[ 0.0036, -0.6611, -1.3848],
[ 0.2605, -0.2406, -1.1865],
[ 0.1748, 0.3794, -0.1744],
[ 0.2352, -0.6782, -0.6030]]]], device='cuda:0', dtype=torch.float16)
!!! output MHA NON KV CACHE at 1 tensor([[[[ 1.8740, -0.3555, -0.2308], [ 1.8223, -0.3279, -0.2108], [ 0.6812, -0.3042, 0.1327], [ 0.8237, -0.4082, 0.2703]]],
[[[ 0.0036, -0.6611, -1.3848],
[ 0.2605, -0.2406, -1.1865],
[ 0.1748, 0.3794, -0.1744],
[ 0.2352, -0.6782, -0.6030]]]], device='cuda:0', dtype=torch.float16)
$$$ output KV CACHE MHA at 2 tensor([[[[-0.2815, 0.2520, -0.2242], [ 0.1653, 0.0293, -0.3726], [ 0.5005, -0.0624, -0.0492], [ 0.3440, 0.3044, -0.2172]]],
[[[ 0.2651, -0.1628, -1.2080],
[ 0.6064, 0.4153, -0.9517],
[ 0.7690, 0.0339, 0.0311],
[ 0.7075, -0.0425, -0.0394]]]], device='cuda:0', dtype=torch.float16)
!!! output MHA NON KV CACHE at 2 tensor([[[[-0.2815, 0.2520, -0.2242], [ 0.1653, 0.0293, -0.3726], [ 0.5005, -0.0624, -0.0492], [ 0.3440, 0.3044, -0.2172]]],
[[[ 0.2651, -0.1628, -1.2080],
[ 0.6064, 0.4153, -0.9517],
[ 0.7690, 0.0339, 0.0311],
[ 0.7075, -0.0425, -0.0394]]]], device='cuda:0', dtype=torch.float16)
But I can run success.
vLLM uses a fork of the flash_attn
repo which can be found here
ImportError("cannot import name 'flash_attn_varlen_func' from 'vllm.vllm_flash_attn' (unknown location)")
I can find on flash-atttion. but vllm_flash-attion not so It is vllm 0.6.3 error?@DarkLight1337@dtrifiro
I can find on flash-atttion. but vllm_flash-attion not so It is vllm 0.6.3 error?
It's listed in this file: https://github.com/vllm-project/flash-attention/blob/5259c586c403a4e4d8bf69973c159b40cc346fb9/vllm_flash_attn/__init__.py
@DarkLight1337 you mean I need replace all vllm_flash_attn files?why not update in vllm-project?
I am not sure what you mean. Those functions are defined inside vllm_flash_attn
as well.
现在 在vllm-project 工程下的vllm_flash_attn文件内容与 https://github.com/vllm-project/flash-attention 文件的内容是不同的。他们属于不同的版本 也就是说在vllm-project的main 源代码版本中,没有更新最新的vllm_flash_attn Now the contents of the vllm_flash_attn file under the vllm-project project are the differnt as https://github.com/vllm-project/flash-attention .They belong to different versions, which means that the main source version of vllm-project is not updated with the latest version of vllm_flash_attn.
After you clone vLLM repo, you should build from source using the provided instructions (in your case, better perform a full build to make sure you have the latest version of the compiled binaries). It should download the files from the vLLM flash-attention
fork and copy them into the main vLLM repo.
现在我就是从源代码安装的。但是克隆vLLM repo 以后。他们的源代码是不一样的 Right now I'm installing from source.But after cloning the vLLM repo.Their source code is different
现在我就是从源代码安装的。但是克隆vLLM repo 以后。他们的源代码是不一样的 Right now I'm installing from source.But after cloning the vLLM repo.Their source code is different
In vLLM main repo, the vllm_flash_attn
directory should be initially empty like this. If this isn't the case, you can try deleting those files and rebuild vLLM to make sure you get the updated version.
我这里也是空的。也就是说vllm_flash_attn 没有被作为子项目。我需要先手动克隆vllm_flash_attnhttps://github.com/vllm-project/flash-attention项目 然后重新安装?
我这里也是空的。也就是说vllm_flash_attn 没有被作为子项目。我需要先手动克隆vllm_flash_attnhttps://github.com/vllm-project/flash-attention项目 然后重新安装?
How are you installing vLLM from source? Can you show the commands which you've used?
@dtrifiro 问题就在这个vllm-project/flash-attention了。 这个项目强制要求torch 版本为2.4.0 而且强行安装torch2.4.0(根本不应该在这里安装torch,应该报错,让用户自己安装) 并且默认找到最高版本的python 而我本机有torch包的python 是3.10.10 不是最高版本。并且我不知道修改哪里。一打开工程就自动生成CMakeLists.txt了。直接指定了python版本为我本地的3.12.4版本 实在不知道怎么修改启动的python版本。卸载了高版本的python 开始编译了
@DarkLight1337 when I build success and get vllm_flash_attn_c.pyd lib exp.Then how can I use these?
This is outside of my domain as I'm not involved with the vLLM build process. @dtrifiro may be able to help you more.
The problem has not been resolved. Need to reopen. Also, can you help me contact dtrifiro Danielea? Only he or their project team can solve it. But @ him, he doesn't respond
------------------ 原始邮件 ------------------ 发件人: "Simon @.>; 发送时间: 2024年10月29日(星期二) 中午1:08 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [vllm-project/vllm] [Installation] pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows (Issue #9701)
Closed #9701 as completed via #9721.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
中国区网络环境不好。不想在折腾torch的版本。之前使用pip install vllm 安装成功以后。发现了torch 被替换成了cpu的这个问题。不过在装完cuda的torch以后我运行过一次vllm。然后报错了, 错误如下(不是很确定后来这个错误有没有被覆盖) WARNING 10-24 21:42:41 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") ERROR 10-24 21:42:49 registry.py:267] Error in inspecting model architecture 'Qwen2ForCausalLM' ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last): ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 429, in _run_in_subprocess ERROR 10-24 21:42:49 registry.py:267] returned.check_returncode() ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\subprocess.py", line 457, in check_returncode ERROR 10-24 21:42:49 registry.py:267] raise CalledProcessError(self.returncode, self.args, self.stdout, ERROR 10-24 21:42:49 registry.py:267] subprocess.CalledProcessError: Command '['d:\my\env\python3.10.10\python.exe', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1. ERROR 10-24 21:42:49 registry.py:267] ERROR 10-24 21:42:49 registry.py:267] The above exception was the direct cause of the following exception: ERROR 10-24 21:42:49 registry.py:267] ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last): ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 265, in _try_inspect_model_cls ERROR 10-24 21:42:49 registry.py:267] return model.inspect_model_cls() ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 227, in inspect_model_cls ERROR 10-24 21:42:49 registry.py:267] return _run_in_subprocess( ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 432, in _run_in_subprocess ERROR 10-24 21:42:49 registry.py:267] raise RuntimeError(f"Error raised in subprocess:\n" ERROR 10-24 21:42:49 registry.py:267] RuntimeError: Error raised in subprocess: ERROR 10-24 21:42:49 registry.py:267] d:\my\env\python3.10.10\lib\runpy.py:126: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour ERROR 10-24 21:42:49 registry.py:267] warn(RuntimeWarning(msg)) ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last): ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\runpy.py", line 196, in _run_module_as_main ERROR 10-24 21:42:49 registry.py:267] return _run_code(code, main_globals, None, ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\runpy.py", line 86, in _run_code ERROR 10-24 21:42:49 registry.py:267] exec(code, run_globals) ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 453, in ERROR 10-24 21:42:49 registry.py:267] _run() ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 448, in _run ERROR 10-24 21:42:49 registry.py:267] with open(output_file, "wb") as f: ERROR 10-24 21:42:49 registry.py:267] PermissionError: [Errno 13] Permission denied: 'C:\Users\Admin\AppData\Local\Temp\tmp6cx7k05c' ERROR 10-24 21:42:49 registry.py:267] ValueError Traceback (most recent call last) Cell In[2], line 5 1 from vllm import LLM, SamplingParams 3 # model_dir='Qwen2.5-14B-Instruct-GPTQ-Int4' ----> 5 llm = LLM(model=model_dir,enforce_eager=True) 6 sampling_params = SamplingParams( top_p=0.9, max_tokens=512,top_k=10) 8 prompt = "1+1等于几"
File d:\my\env\python3.10.10\lib\site-packages\vllm\entrypoints\llm.py:177, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, mm_processor_kwargs, kwargs) 152 kwargs["disable_log_stats"] = True 154 engine_args = EngineArgs( 155 model=model, 156 tokenizer=tokenizer, (...) 175 kwargs, 176 ) --> 177 self.llm_engine = LLMEngine.from_engine_args( 178 engine_args, usage_context=UsageContext.LLM_CLASS) 179 self.request_counter = Counter()
File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\llm_engine.py:570, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers) 568 """Creates an LLM engine from the engine arguments.""" 569 # Create the engine configs. --> 570 engine_config = engine_args.create_engine_config() 571 executor_class = cls._get_executor_cls(engine_config) 572 # Create the LLM engine.
File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\arg_utils.py:903, in EngineArgs.create_engine_config(self) 898 assert self.cpu_offload_gb >= 0, ( 899 "CPU offload space must be non-negative" 900 f", but got {self.cpu_offload_gb}") 902 device_config = DeviceConfig(device=self.device) --> 903 model_config = self.create_model_config() 905 if model_config.is_multimodal_model: 906 if self.enable_prefix_caching:
File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\arg_utils.py:839, in EngineArgs.create_model_config(self) 838 def create_model_config(self) -> ModelConfig: --> 839 return ModelConfig( 840 model=self.model, 841 # We know this is not None because we set it in post_init 842 tokenizer=cast(str, self.tokenizer), 843 tokenizer_mode=self.tokenizer_mode, 844 trust_remote_code=self.trust_remote_code, 845 dtype=self.dtype, 846 seed=self.seed, 847 revision=self.revision, 848 code_revision=self.code_revision, 849 rope_scaling=self.rope_scaling, 850 rope_theta=self.rope_theta, 851 tokenizer_revision=self.tokenizer_revision, 852 max_model_len=self.max_model_len, 853 quantization=self.quantization, 854 quantization_param_path=self.quantization_param_path, 855 enforce_eager=self.enforce_eager, 856 max_context_len_to_capture=self.max_context_len_to_capture, 857 max_seq_len_to_capture=self.max_seq_len_to_capture, 858 max_logprobs=self.max_logprobs, 859 disable_sliding_window=self.disable_sliding_window, 860 skip_tokenizer_init=self.skip_tokenizer_init, 861 served_model_name=self.served_model_name, 862 limit_mm_per_prompt=self.limit_mm_per_prompt, 863 use_async_output_proc=not self.disable_async_output_proc, 864 override_neuron_config=self.override_neuron_config, 865 config_format=self.config_format, 866 mm_processor_kwargs=self.mm_processor_kwargs, 867 )
File d:\my\env\python3.10.10\lib\site-packages\vllm\config.py:200, in ModelConfig.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, dtype, seed, revision, code_revision, rope_scaling, rope_theta, tokenizer_revision, max_model_len, spec_target_max_model_len, quantization, quantization_param_path, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, max_logprobs, disable_sliding_window, skip_tokenizer_init, served_model_name, limit_mm_per_prompt, use_async_output_proc, override_neuron_config, config_format, mm_processor_kwargs) 192 self.max_model_len = _get_and_verify_max_len( 193 hf_config=self.hf_text_config, 194 max_model_len=max_model_len, 195 disable_sliding_window=self.disable_sliding_window, 196 sliding_window_len=self.get_hf_config_sliding_window(), 197 spec_target_max_model_len=spec_target_max_model_len) 198 self.served_model_name = get_served_model_name(model, 199 served_model_name) --> 200 self.multimodal_config = self._init_multimodal_config( 201 limit_mm_per_prompt) 202 if not self.skip_tokenizer_init: 203 self._verify_tokenizer_mode()
File d:\my\env\python3.10.10\lib\site-packages\vllm\config.py:219, in ModelConfig._init_multimodal_config(self, limit_mm_per_prompt) 215 def _init_multimodal_config( 216 self, limit_mm_per_prompt: Optional[Mapping[str, int]] 217 ) -> Optional["MultiModalConfig"]: 218 architectures = getattr(self.hf_config, "architectures", []) --> 219 if ModelRegistry.is_multimodal_model(architectures): 220 return MultiModalConfig(limit_per_prompt=limit_mm_per_prompt or {}) 222 if limit_mm_per_prompt:
File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:387, in _ModelRegistry.is_multimodal_model(self, architectures) 383 def is_multimodal_model( 384 self, 385 architectures: Union[str, List[str]], 386 ) -> bool: --> 387 return self.inspect_model_cls(architectures).supports_multimodal
File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:356, in _ModelRegistry.inspect_model_cls(self, architectures) 353 if model_info is not None: 354 return model_info --> 356 return self._raise_for_unsupported(architectures)
File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:317, in _ModelRegistry._raise_for_unsupported(self, architectures) 314 def _raise_for_unsupported(self, architectures: List[str]): 315 all_supported_archs = self.get_supported_archs() --> 317 raise ValueError( 318 f"Model architectures {architectures} are not supported for now. " 319 f"Supported architectures: {all_supported_archs}")
ValueError: Model architectures ['Qwen2ForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'Gemma2Model', 'MistralModel', 'Qwen2ForRewardModel', 'Phi3VForCausalLM', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel']
一样的问题
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support
-- USE_CUDSS is set to 0. Compiling without cuDSS support
-- USE_CUFILE is set to 0. Compiling without cuFile support
-- Autodetected CUDA architecture(s): 8.9
-- Added CUDA NVCC flags for: -gencode;arch=compute_89,code=sm_89
-- CUDA supported arches: 8.0;8.6;8.9;9.0
-- CUDA target arches: 89-real
-- Configuring done (7.5s)
-- Generating done (0.1s)
-- Build files have been written to: D:/my/work/LLM/vllm/vllm-flash-attention/flash-attention
Error: could not load cache
Traceback (most recent call last):
File "D:\my\work\LLM\vllm\vllm-flash-attention\flash-attention\setup.py", line 325, in
it is not resolved for now.
I have torch 2.5.1+cu121
, vllm still will re-install torch 2.5.0, it will broken my torch to use cuda.
pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows. pip install vllm(0.6.3)将强制重新安装CPU版本的torch并在Windows上替换cuda torch。
What is your original version of pytorch?
Originally posted by @DarkLight1337 in https://github.com/vllm-project/vllm/issues/4194#issuecomment-2435665167
pip show torch Name: torch Version: 2.5.0+cu124 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3-Clause Location: d:\my\env\python3.10.10\lib\site-packages Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions Required-by: accelerate, auto_gptq, bitsandbytes, compressed-tensors, encodec, flash_attn, optimum, peft, stable-baselines3, timm, t orchaudio, torchvision, trl, vector-quantize-pytorch, vocos