vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.71k stars 4.66k forks source link

[Installation] pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows #9701

Open xiezhipeng-git opened 1 month ago

xiezhipeng-git commented 1 month ago

pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows. pip install vllm(0.6.3)将强制重新安装CPU版本的torch并在Windows上替换cuda torch。

I don't quite get what you mean, how can you have different versions of torch for CPU and GPU at the same time?我不太明白你的意思,你怎么能有不同版本的火炬CPU和GPU在同一时间?

only cuda torch

 pip install vllm --no-deps
Collecting vllm
  Using cached vllm-0.6.3.post1.tar.gz (2.7 MB)
  Installing build dependencies ... error
  error: subprocess-exited-with-error

  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 2
  ╰─> [86 lines of output]
      Collecting cmake>=3.26
        Using cached cmake-3.30.5-py3-none-win_amd64.whl.metadata (6.4 kB)
      Collecting ninja
        Using cached ninja-1.11.1.1-py2.py3-none-win_amd64.whl.metadata (5.4 kB)

      Collecting packaging
        Using cached packaging-24.1-py3-none-any.whl.metadata (3.2 kB)
      Collecting setuptools>=61
        Using cached setuptools-75.2.0-py3-none-any.whl.metadata (6.9 kB)
      Collecting setuptools-scm>=8.0
        Using cached setuptools_scm-8.1.0-py3-none-any.whl.metadata (6.6 kB)
      Collecting torch==2.4.0
        Using cached torch-2.4.0-cp310-cp310-win_amd64.whl.metadata (27 kB)
      Collecting wheel
        Using cached wheel-0.44.0-py3-none-any.whl.metadata (2.3 kB)
      Collecting jinja2
        Using cached jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
      Collecting filelock (from torch==2.4.0)
        Using cached filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
      Collecting typing-extensions>=4.8.0 (from torch==2.4.0)
        Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)

      Collecting sympy (from torch==2.4.0)
        Using cached sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
      Collecting networkx (from torch==2.4.0)
        Using cached networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
      Collecting fsspec (from torch==2.4.0)
        Using cached fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
      Collecting tomli>=1 (from setuptools-scm>=8.0)
        Using cached tomli-2.0.2-py3-none-any.whl.metadata (10.0 kB)
      Collecting MarkupSafe>=2.0 (from jinja2)
        Using cached MarkupSafe-3.0.2-cp310-cp310-win_amd64.whl.metadata (4.1 kB
)
      Collecting mpmath<1.4,>=1.1.0 (from sympy->torch==2.4.0)
        Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
      Downloading torch-2.4.0-cp310-cp310-win_amd64.whl (197.9 MB)
                                                  3.9/197.9 MB 21.3 kB/s eta 2:3
1:31
      ERROR: Exception:
      Traceback (most recent call last):
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 438, in _error_catcher
          yield
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 561, in read
          data = self._fp_read(amt) if not fp_closed else b""
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 527, in _fp_read
          return self._fp.read(amt) if amt is not None else self._fp.read()
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\cachecontrol
\filewrapper.py", line 98, in read
          data: bytes = self.__fp.read(amt)
        File "D:\my\env\python3.10.10\lib\http\client.py", line 465, in read
          s = self.fp.read(amt)
        File "D:\my\env\python3.10.10\lib\socket.py", line 705, in readinto
          return self._sock.recv_into(b)
        File "D:\my\env\python3.10.10\lib\ssl.py", line 1274, in recv_into
          return self.read(nbytes, buffer)
        File "D:\my\env\python3.10.10\lib\ssl.py", line 1130, in read
          return self._sslobj.read(len, buffer)
      TimeoutError: The read operation timed out

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\base_c
ommand.py", line 105, in _run_wrapper
          status = _inner_run()
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\base_c
ommand.py", line 96, in _inner_run
          return self.run(options, args)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\req_co
mmand.py", line 67, in wrapper
          return func(self, options, args)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\commands\i
nstall.py", line 379, in run
          requirement_set = resolver.resolve(
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\resolution
\resolvelib\resolver.py", line 179, in resolve
          self.factory.preparer.prepare_linked_requirements_more(reqs)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\operations
\prepare.py", line 554, in prepare_linked_requirements_more
          self._complete_partial_requirements(
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\operations
\prepare.py", line 469, in _complete_partial_requirements
          for link, (filepath, _) in batch_download:
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\network\do
wnload.py", line 184, in __call__
          for chunk in chunks:
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\cli\progre
ss_bars.py", line 55, in _rich_progress_bar
          for chunk in iterable:
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_internal\network\ut
ils.py", line 65, in response_chunks
          for chunk in response.raw.stream(
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 622, in stream
          data = self.read(amt=amt, decode_content=decode_content)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 560, in read
          with self._error_catcher():
        File "D:\my\env\python3.10.10\lib\contextlib.py", line 153, in __exit__
          self.gen.throw(typ, value, traceback)
        File "D:\my\env\python3.10.10\Lib\site-packages\pip\_vendor\urllib3\resp
onse.py", line 443, in _error_catcher
          raise ReadTimeoutError(self._pool, None, "Read timed out.")
      pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host=
'files.pythonhosted.org', port=443): Read timed out.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem wit
h pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 2
╰─> See above for output.

If you internet is not good. You are so lucky. Because it will fail during the process of forcibly replacing CUDA torch with CPU. If you have a good internet connection. So things will become very bad. Your torch will transition from CUDA to a lower version CPU. And pip install vllm --no-deps or pip install vllm has same issue

What is your original version of pytorch?

Originally posted by @DarkLight1337 in https://github.com/vllm-project/vllm/issues/4194#issuecomment-2435665167

pip show torch Name: torch Version: 2.5.0+cu124 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3-Clause Location: d:\my\env\python3.10.10\lib\site-packages Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions Required-by: accelerate, auto_gptq, bitsandbytes, compressed-tensors, encodec, flash_attn, optimum, peft, stable-baselines3, timm, t orchaudio, torchvision, trl, vector-quantize-pytorch, vocos

xiezhipeng-git commented 1 month ago

@DarkLight1337

DarkLight1337 commented 4 weeks ago

Please follow these instructions on how to install for CPU.

xiezhipeng-git commented 4 weeks ago

Please follow these instructions on how to install for CPU.

why install cpu?and you say cpu is vllm?this function can use in windows?
and this  pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
will replace torch? It is error.
DarkLight1337 commented 4 weeks ago

Please follow these instructions on how to install for CPU.

why install cpu?and you say cpu is vllm?this function can use in windows?
and this  pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
will replace torch? It is error.

Based on your title, you originally have PyTorch for CPU installed and do not want the CUDA version to be installed, so I guess you want the CPU version of vLLM as well. Correct me if I'm wrong.

DarkLight1337 commented 4 weeks ago

On Windows, you should be able to use vLLM via WSL if I recall correctly.

xiezhipeng-git commented 4 weeks ago

Please follow these instructions on how to install for CPU.请按照以下说明安装CPU

why install cpu?and you say cpu is vllm?this function can use in windows?
and this  pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
will replace torch? It is error.

Based on your title, you originally have PyTorch for CPU installed and do not want the CUDA version to be installed, so I guess you want the CPU version of vLLM as well. Correct me if I'm wrong.根据你的标题,你最初安装了PyTorch for CPU,不想安装CUDA版本,所以我猜你也想要vLLM的CPU版本。如果我说错了请纠正我。

Of course not. It is said from beginning to end that the CUDA version has been replaced with the CPU version. Of course, all I want is the CUDA version。 如果你看得懂中文的话。还是用 中文交流吧。省的再有误解。从头到尾都说的是cuda版本被cpu替换了。并且我还说变成cpu版本这非常糟糕。只能重装cuda版本。中国的网络环境,这非常浪费时间。只要cuda版本 从来没有一次装过cpu版本的vllm .都是vllm 出错导致cuda版本被替换成了cpu。标题也是一个意思。 如果可能我不希望在使用wsl了。因为wsl有很大的概率导致会有未知情况下虚拟机崩溃后无法再次启动。然后整个虚拟机就没法用了。找wsl开发团队也没办法解决。100多个G的内容只能删掉了。完全浪费青春

DarkLight1337 commented 4 weeks ago

Please follow these instructions on how to install for CPU.请按照以下说明安装CPU

why install cpu?and you say cpu is vllm?this function can use in windows?
and this  pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
will replace torch? It is error.

Based on your title, you originally have PyTorch for CPU installed and do not want the CUDA version to be installed, so I guess you want the CPU version of vLLM as well. Correct me if I'm wrong.根据你的标题,你最初安装了PyTorch for CPU,不想安装CUDA版本,所以我猜你也想要vLLM的CPU版本。如果我说错了请纠正我。

Of course not. It is said from beginning to end that the CUDA version has been replaced with the CPU version. Of course, all I want is the CUDA version。 如果你看得懂中文的话。还是用 中文交流吧。省的再有误解。从头到尾都说的是cuda版本被cpu替换了。并且我还说变成cpu版本这非常糟糕。只能重装cuda版本。中国的网络环境,这非常浪费时间。只要cuda版本

Oh sorry, I somehow read it the other way round. vLLM only officially supports Linux OS so it might not be able to detect your CUDA from native Windows. I suggest using vLLM through WSL.

xiezhipeng-git commented 4 weeks ago

Please follow these instructions on how to install for CPU.请按照以下说明安装CPU

why install cpu?and you say cpu is vllm?this function can use in windows?
and this  pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
will replace torch? It is error.

Based on your title, you originally have PyTorch for CPU installed and do not want the CUDA version to be installed, so I guess you want the CPU version of vLLM as well. Correct me if I'm wrong.根据你的标题,你最初安装了PyTorch for CPU,不想安装CUDA版本,所以我猜你也想要vLLM的CPU版本。如果我说错了请纠正我。

Of course not. It is said from beginning to end that the CUDA version has been replaced with the CPU version. Of course, all I want is the CUDA version。 如果你看得懂中文的话。还是用 中文交流吧。省的再有误解。从头到尾都说的是cuda版本被cpu替换了。并且我还说变成cpu版本这非常糟糕。只能重装cuda版本。中国的网络环境,这非常浪费时间。只要cuda版本

Oh sorry, I somehow read it the other way round. vLLM only officially supports Linux OS so it might not be able to detect your CUDA from native Windows. I suggest using vLLM through WSL.

wsl 有致命缺陷。如果可以我都不在愿意使用wsl了。你的问题检测不到torch cuda。这听起来像是pip 和cmake命令没有处理好。可以学习一下flashattntion .flashattntion 可以在windows上安装了。通过 不隔离环境 就可以找到torch 并编译安装。如果问题是这样的话,应该是不难解决的。 WSL has a fatal flaw. If possible, I am no longer willing to use WSL. Your issue cannot detect torch CUDA. This sounds like the pip and cmake commands were not handled properly. You can learn Flashatttion. Flashhattntion can be installed on Windows now. By not isolating the environment, torch can be found and compiled for installation. If that's the case, it shouldn't be difficult to solve.

DarkLight1337 commented 4 weeks ago

@dtrifiro what's your opinion on supporting Windows? Is it feasible at this stage?

xiezhipeng-git commented 4 weeks ago

@DarkLight1337 @dtrifiro 而且之所以torch版本被替换成cpu .估计是因为你们对torch的版本需求写的是==2.4.0 。改成>=2.4.0.应该就不会有被自动重装的问题了。然后在考虑看看能否正式支持windows And the reason why the torch version was replaced with cpu .Probably because you guys wrote ==2.4.0 for the torch version requirement .Change it to >=2.4.0. and you shouldn't have the problem of being reinstalled automatically.Then we'll see if we can officially support windows.

DarkLight1337 commented 4 weeks ago

@DarkLight1337 @dtrifiro 而且之所以torch版本被替换成cpu .估计是因为你们对torch的版本需求写的是==2.4.0 。改成>=2.4.0.应该就不会有被自动重装的问题了。然后在考虑看看能否正式支持windows And the reason why the torch version was replaced with cpu .Probably because you guys wrote ==2.4.0 for the torch version requirement .Change it to >=2.4.0. and you shouldn't have the problem of being reinstalled automatically.Then we'll see if we can officially support windows.

From my understanding, PyTorch installation should be able to automatically choose CPU/CUDA based on your machine. What happens if you just install torch==2.4.0 directly?

xiezhipeng-git commented 4 weeks ago

中国区网络环境不好。不想在折腾torch的版本。之前使用pip install vllm 安装成功以后。发现了torch 被替换成了cpu的这个问题。不过在装完cuda的torch以后我运行过一次vllm。然后报错了, 错误如下(不是很确定后来这个错误有没有被覆盖) WARNING 10-24 21:42:41 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") ERROR 10-24 21:42:49 registry.py:267] Error in inspecting model architecture 'Qwen2ForCausalLM' ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last): ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 429, in _run_in_subprocess ERROR 10-24 21:42:49 registry.py:267] returned.check_returncode() ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\subprocess.py", line 457, in check_returncode ERROR 10-24 21:42:49 registry.py:267] raise CalledProcessError(self.returncode, self.args, self.stdout, ERROR 10-24 21:42:49 registry.py:267] subprocess.CalledProcessError: Command '['d:\my\env\python3.10.10\python.exe', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1. ERROR 10-24 21:42:49 registry.py:267] ERROR 10-24 21:42:49 registry.py:267] The above exception was the direct cause of the following exception: ERROR 10-24 21:42:49 registry.py:267] ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last): ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 265, in _try_inspect_model_cls ERROR 10-24 21:42:49 registry.py:267] return model.inspect_model_cls() ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 227, in inspect_model_cls ERROR 10-24 21:42:49 registry.py:267] return _run_in_subprocess( ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 432, in _run_in_subprocess ERROR 10-24 21:42:49 registry.py:267] raise RuntimeError(f"Error raised in subprocess:\n" ERROR 10-24 21:42:49 registry.py:267] RuntimeError: Error raised in subprocess: ERROR 10-24 21:42:49 registry.py:267] d:\my\env\python3.10.10\lib\runpy.py:126: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour ERROR 10-24 21:42:49 registry.py:267] warn(RuntimeWarning(msg)) ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last): ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\runpy.py", line 196, in _run_module_as_main ERROR 10-24 21:42:49 registry.py:267] return _run_code(code, main_globals, None, ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\runpy.py", line 86, in _run_code ERROR 10-24 21:42:49 registry.py:267] exec(code, run_globals) ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 453, in ERROR 10-24 21:42:49 registry.py:267] _run() ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 448, in _run ERROR 10-24 21:42:49 registry.py:267] with open(output_file, "wb") as f: ERROR 10-24 21:42:49 registry.py:267] PermissionError: [Errno 13] Permission denied: 'C:\Users\Admin\AppData\Local\Temp\tmp6cx7k05c' ERROR 10-24 21:42:49 registry.py:267] ValueError Traceback (most recent call last) Cell In[2], line 5 1 from vllm import LLM, SamplingParams 3 # model_dir='Qwen2.5-14B-Instruct-GPTQ-Int4' ----> 5 llm = LLM(model=model_dir,enforce_eager=True) 6 sampling_params = SamplingParams( top_p=0.9, max_tokens=512,top_k=10) 8 prompt = "1+1等于几"

File d:\my\env\python3.10.10\lib\site-packages\vllm\entrypoints\llm.py:177, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, mm_processor_kwargs, kwargs) 152 kwargs["disable_log_stats"] = True 154 engine_args = EngineArgs( 155 model=model, 156 tokenizer=tokenizer, (...) 175 kwargs, 176 ) --> 177 self.llm_engine = LLMEngine.from_engine_args( 178 engine_args, usage_context=UsageContext.LLM_CLASS) 179 self.request_counter = Counter()

File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\llm_engine.py:570, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers) 568 """Creates an LLM engine from the engine arguments.""" 569 # Create the engine configs. --> 570 engine_config = engine_args.create_engine_config() 571 executor_class = cls._get_executor_cls(engine_config) 572 # Create the LLM engine.

File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\arg_utils.py:903, in EngineArgs.create_engine_config(self) 898 assert self.cpu_offload_gb >= 0, ( 899 "CPU offload space must be non-negative" 900 f", but got {self.cpu_offload_gb}") 902 device_config = DeviceConfig(device=self.device) --> 903 model_config = self.create_model_config() 905 if model_config.is_multimodal_model: 906 if self.enable_prefix_caching:

File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\arg_utils.py:839, in EngineArgs.create_model_config(self) 838 def create_model_config(self) -> ModelConfig: --> 839 return ModelConfig( 840 model=self.model, 841 # We know this is not None because we set it in __post_init__ 842 tokenizer=cast(str, self.tokenizer), 843 tokenizer_mode=self.tokenizer_mode, 844 trust_remote_code=self.trust_remote_code, 845 dtype=self.dtype, 846 seed=self.seed, 847 revision=self.revision, 848 code_revision=self.code_revision, 849 rope_scaling=self.rope_scaling, 850 rope_theta=self.rope_theta, 851 tokenizer_revision=self.tokenizer_revision, 852 max_model_len=self.max_model_len, 853 quantization=self.quantization, 854 quantization_param_path=self.quantization_param_path, 855 enforce_eager=self.enforce_eager, 856 max_context_len_to_capture=self.max_context_len_to_capture, 857 max_seq_len_to_capture=self.max_seq_len_to_capture, 858 max_logprobs=self.max_logprobs, 859 disable_sliding_window=self.disable_sliding_window, 860 skip_tokenizer_init=self.skip_tokenizer_init, 861 served_model_name=self.served_model_name, 862 limit_mm_per_prompt=self.limit_mm_per_prompt, 863 use_async_output_proc=not self.disable_async_output_proc, 864 override_neuron_config=self.override_neuron_config, 865 config_format=self.config_format, 866 mm_processor_kwargs=self.mm_processor_kwargs, 867 )

File d:\my\env\python3.10.10\lib\site-packages\vllm\config.py:200, in ModelConfig.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, dtype, seed, revision, code_revision, rope_scaling, rope_theta, tokenizer_revision, max_model_len, spec_target_max_model_len, quantization, quantization_param_path, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, max_logprobs, disable_sliding_window, skip_tokenizer_init, served_model_name, limit_mm_per_prompt, use_async_output_proc, override_neuron_config, config_format, mm_processor_kwargs) 192 self.max_model_len = _get_and_verify_max_len( 193 hf_config=self.hf_text_config, 194 max_model_len=max_model_len, 195 disable_sliding_window=self.disable_sliding_window, 196 sliding_window_len=self.get_hf_config_sliding_window(), 197 spec_target_max_model_len=spec_target_max_model_len) 198 self.served_model_name = get_served_model_name(model, 199 served_model_name) --> 200 self.multimodal_config = self._init_multimodal_config( 201 limit_mm_per_prompt) 202 if not self.skip_tokenizer_init: 203 self._verify_tokenizer_mode()

File d:\my\env\python3.10.10\lib\site-packages\vllm\config.py:219, in ModelConfig._init_multimodal_config(self, limit_mm_per_prompt) 215 def _init_multimodal_config( 216 self, limit_mm_per_prompt: Optional[Mapping[str, int]] 217 ) -> Optional["MultiModalConfig"]: 218 architectures = getattr(self.hf_config, "architectures", []) --> 219 if ModelRegistry.is_multimodal_model(architectures): 220 return MultiModalConfig(limit_per_prompt=limit_mm_per_prompt or {}) 222 if limit_mm_per_prompt:

File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:387, in _ModelRegistry.is_multimodal_model(self, architectures) 383 def is_multimodal_model( 384 self, 385 architectures: Union[str, List[str]], 386 ) -> bool: --> 387 return self.inspect_model_cls(architectures).supports_multimodal

File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:356, in _ModelRegistry.inspect_model_cls(self, architectures) 353 if model_info is not None: 354 return model_info --> 356 return self._raise_for_unsupported(architectures)

File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:317, in _ModelRegistry._raise_for_unsupported(self, architectures) 314 def _raise_for_unsupported(self, architectures: List[str]): 315 all_supported_archs = self.get_supported_archs() --> 317 raise ValueError( 318 f"Model architectures {architectures} are not supported for now. " 319 f"Supported architectures: {all_supported_archs}")

ValueError: Model architectures ['Qwen2ForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'Gemma2Model', 'MistralModel', 'Qwen2ForRewardModel', 'Phi3VForCausalLM', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel']

DarkLight1337 commented 4 weeks ago

ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 448, in _run ERROR 10-24 21:42:49 registry.py:267] with open(output_file, "wb") as f: ERROR 10-24 21:42:49 registry.py:267] PermissionError: [Errno 13] Permission denied: 'C:\Users\Admin\AppData\Local\Temp\tmp6cx7k05c'

It looks like this error I've encountered before: https://stackoverflow.com/questions/23212435/permission-denied-to-write-to-my-temporary-file. It can be solved by writing to a temporary directory instead, see if I can fix this real quick.

shaoyuyoung commented 3 weeks ago

Hi @DarkLight1337 , can you take a look at this installation issue: #9180 thanks in advance

DarkLight1337 commented 3 weeks ago

ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 448, in _run ERROR 10-24 21:42:49 registry.py:267] with open(output_file, "wb") as f: ERROR 10-24 21:42:49 registry.py:267] PermissionError: [Errno 13] Permission denied: 'C:\Users\Admin\AppData\Local\Temp\tmp6cx7k05c'

It looks like this error I've encountered before: https://stackoverflow.com/questions/23212435/permission-denied-to-write-to-my-temporary-file. It can be solved by writing to a temporary directory instead, see if I can fix this real quick.

Fixed. Feel free to reopen if you still encounter issues.

xiezhipeng-git commented 3 weeks ago

@DarkLight1337 has new error

TypeError                                 Traceback (most recent call last)
Cell In[2], [line 5](vscode-notebook-cell:?execution_count=2&line=5)
      [1](vscode-notebook-cell:?execution_count=2&line=1) from vllm import LLM, SamplingParams
      [3](vscode-notebook-cell:?execution_count=2&line=3) # model_dir='Qwen2.5-14B-Instruct-GPTQ-Int4'
----> [5](vscode-notebook-cell:?execution_count=2&line=5) llm = LLM(model=model_dir,enforce_eager=True)
      [6](vscode-notebook-cell:?execution_count=2&line=6) sampling_params = SamplingParams( top_p=0.9,  max_tokens=512,top_k=10)
      [8](vscode-notebook-cell:?execution_count=2&line=8) prompt = "1+1等于几"

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\utils.py:1023, in deprecate_args.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   [1016](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1016)             msg += f" {additional_message}"
   [1018](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1018)         warnings.warn(
   [1019](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1019)             DeprecationWarning(msg),
   [1020](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1020)             stacklevel=3,  # The inner function takes up one level
   [1021](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1021)         )
-> [1023](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1023) return fn(*args, **kwargs)

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\entrypoints\llm.py:198, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, mm_processor_kwargs, task, **kwargs)
    [172](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:172)     kwargs["disable_log_stats"] = True
    [174](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:174) engine_args = EngineArgs(
    [175](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:175)     model=model,
    [176](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:176)     task=task,
   (...)
    [196](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:196)     **kwargs,
    [197](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:197) )
--> [198](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:198) self.llm_engine = LLMEngine.from_engine_args(
    [199](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:199)     engine_args, usage_context=UsageContext.LLM_CLASS)
    [200](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:200) self.request_counter = Counter()

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\engine\llm_engine.py:582, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
    [580](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:580) executor_class = cls._get_executor_cls(engine_config)
    [581](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:581) # Create the LLM engine.
--> [582](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:582) engine = cls(
    [583](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:583)     **engine_config.to_dict(),
    [584](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:584)     executor_class=executor_class,
    [585](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:585)     log_stats=not engine_args.disable_log_stats,
    [586](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:586)     usage_context=usage_context,
    [587](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:587)     stat_loggers=stat_loggers,
    [588](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:588) )
    [590](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:590) return engine

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\engine\llm_engine.py:341, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers, input_registry, use_cached_outputs)
    [337](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:337) self.input_registry = input_registry
    [338](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:338) self.input_processor = input_registry.create_input_processor(
    [339](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:339)     model_config)
--> [341](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:341) self.model_executor = executor_class(
    [342](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:342)     model_config=model_config,
    [343](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:343)     cache_config=cache_config,
    [344](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:344)     parallel_config=parallel_config,
    [345](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:345)     scheduler_config=scheduler_config,
    [346](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:346)     device_config=device_config,
    [347](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:347)     lora_config=lora_config,
    [348](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:348)     speculative_config=speculative_config,
    [349](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:349)     load_config=load_config,
    [350](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:350)     prompt_adapter_config=prompt_adapter_config,
    [351](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:351)     observability_config=self.observability_config,
    [352](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:352) )
    [354](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:354) if self.model_config.task != "embedding":
    [355](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:355)     self._initialize_kv_caches()

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\executor_base.py:47, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, prompt_adapter_config, observability_config)
     [45](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:45) self.prompt_adapter_config = prompt_adapter_config
     [46](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:46) self.observability_config = observability_config
---> [47](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:47) self._init_executor()

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:38, in GPUExecutor._init_executor(self)
     [33](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:33) """Initialize the worker and load the model.
     [34](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:34) """
     [35](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:35) assert self.parallel_config.world_size == 1, (
     [36](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:36)     "GPUExecutor only supports single GPU.")
---> [38](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:38) self.driver_worker = self._create_worker()
     [39](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:39) self.driver_worker.init_device()
     [40](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:40) self.driver_worker.load_model()

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:105, in GPUExecutor._create_worker(self, local_rank, rank, distributed_init_method)
    [101](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:101) def _create_worker(self,
    [102](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:102)                    local_rank: int = 0,
    [103](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:103)                    rank: int = 0,
    [104](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:104)                    distributed_init_method: Optional[str] = None):
--> [105](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:105)     return create_worker(**self._get_create_worker_kwargs(
    [106](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:106)         local_rank=local_rank,
    [107](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:107)         rank=rank,
    [108](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:108)         distributed_init_method=distributed_init_method))

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:24, in create_worker(worker_module_name, worker_class_name, worker_class_fn, **kwargs)
     [16](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:16) def create_worker(worker_module_name: str, worker_class_name: str,
     [17](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:17)                   worker_class_fn: Optional[Callable[[], Type[WorkerBase]]],
     [18](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:18)                   **kwargs):
     [19](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:19)     wrapper = WorkerWrapperBase(
     [20](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:20)         worker_module_name=worker_module_name,
     [21](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:21)         worker_class_name=worker_class_name,
     [22](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:22)         worker_class_fn=worker_class_fn,
     [23](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:23)     )
---> [24](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:24)     wrapper.init_worker(**kwargs)
     [25](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:25)     return wrapper.worker

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\worker_base.py:449, in WorkerWrapperBase.init_worker(self, *args, **kwargs)
    [446](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:446)     mod = importlib.import_module(self.worker_module_name)
    [447](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:447)     worker_class = getattr(mod, self.worker_class_name)
--> [449](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:449) self.worker = worker_class(*args, **kwargs)
    [450](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:450) assert self.worker is not None

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\worker.py:99, in Worker.__init__(self, model_config, parallel_config, scheduler_config, device_config, cache_config, load_config, local_rank, rank, distributed_init_method, lora_config, speculative_config, prompt_adapter_config, is_driver_worker, model_runner_cls, observability_config)
     [97](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:97) elif self._is_encoder_decoder_model():
     [98](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:98)     ModelRunnerClass = EncoderDecoderModelRunner
---> [99](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:99) self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
    [100](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:100)     model_config,
    [101](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:101)     parallel_config,
    [102](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:102)     scheduler_config,
    [103](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:103)     device_config,
    [104](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:104)     cache_config,
    [105](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:105)     load_config=load_config,
    [106](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:106)     lora_config=self.lora_config,
    [107](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:107)     kv_cache_dtype=self.cache_config.cache_dtype,
    [108](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:108)     is_driver_worker=is_driver_worker,
    [109](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:109)     prompt_adapter_config=prompt_adapter_config,
    [110](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:110)     observability_config=observability_config,
    [111](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:111)     **speculative_args,
    [112](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:112) )
    [113](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:113) # Uninitialized cache engine. Will be initialized by
    [114](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:114) # initialize_cache.
    [115](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:115) self.cache_engine: List[CacheEngine]

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\model_runner.py:1013, in GPUModelRunnerBase.__init__(self, model_config, parallel_config, scheduler_config, device_config, cache_config, load_config, lora_config, kv_cache_dtype, is_driver_worker, prompt_adapter_config, return_hidden_states, observability_config, input_registry, mm_registry)
   [1008](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1008) num_attn_heads = self.model_config.get_num_attention_heads(
   [1009](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1009)     self.parallel_config)
   [1010](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1010) needs_attn_backend = (num_attn_heads != 0
   [1011](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1011)                       or self.model_config.is_attention_free)
-> [1013](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1013) self.attn_backend = get_attn_backend(
   [1014](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1014)     self.model_config.get_head_size(),
   [1015](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1015)     self.model_config.dtype,
   [1016](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1016)     self.kv_cache_dtype,
   [1017](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1017)     self.block_size,
   [1018](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1018)     self.model_config.is_attention_free,
   [1019](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1019) ) if needs_attn_backend else None
   [1020](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1020) if self.attn_backend:
   [1021](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1021)     self.attn_state = self.attn_backend.get_state_cls()(
   [1022](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1022)         weakref.proxy(self))

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\attention\selector.py:120, in get_attn_backend(head_size, dtype, kv_cache_dtype, block_size, is_attention_free, is_blocksparse)
    [118](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:118) if backend == _Backend.XFORMERS:
    [119](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:119)     logger.info("Using XFormers backend.")
--> [120](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:120)     from vllm.attention.backends.xformers import (  # noqa: F401
    [121](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:121)         XFormersBackend)
    [122](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:122)     return XFormersBackend
    [123](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:123) elif backend == _Backend.ROCM_FLASH:

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\attention\backends\xformers.py:6
      [3](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:3) from typing import Any, Dict, List, Optional, Tuple, Type
      [5](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:5) import torch
----> [6](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:6) from xformers import ops as xops
      [7](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:7) from xformers.ops.fmha.attn_bias import (AttentionBias,
      [8](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:8)                                          BlockDiagonalCausalMask,
      [9](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:9)                                          BlockDiagonalMask,
     [10](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:10)                                          LowerTriangularMaskWithTensorBias)
     [12](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:12) from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl,
     [13](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:13)                                               AttentionMetadata, AttentionType)

File d:\my\env\python3.[1](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:1)0.10\lib\site-packages\xformers\ops\__init__.py:8
      1 # Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
      [2](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:2) #
      [3](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:3) # This source code is licensed under the BSD license found in the
      [4](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:4) # LICENSE file in the root directory of this source tree.
      [6](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:6) import torch
----> [8](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:8) from .fmha import (
      [9](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:9)     AttentionBias,
     [10](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:10)     AttentionOp,
     [11](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:11)     AttentionOpBase,
     [12](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:12)     LowerTriangularMask,
     [13](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:13)     MemoryEfficientAttentionCkOp,
     [14](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:14)     MemoryEfficientAttentionCutlassFwdFlashBwOp,
     [15](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:15)     MemoryEfficientAttentionCutlassOp,
     [16](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:16)     MemoryEfficientAttentionFlashAttentionOp,
     [17](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:17)     MemoryEfficientAttentionSplitKCkOp,
     [18](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:18)     memory_efficient_attention,
     [19](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:19)     memory_efficient_attention_backward,
     [20](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:20)     memory_efficient_attention_forward,
     [21](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:21)     memory_efficient_attention_forward_requires_grad,
     [22](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:22) )
     [23](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:23) from .indexing import index_select_cat, scaled_index_add
     [24](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:24) from .ipc import init_ipc

File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\__init__.py:10
      [6](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:6) from typing import Any, List, Optional, Sequence, Tuple, Type, Union, cast
      [8](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:8) import torch
---> [10](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:10) from . import (
     [11](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:11)     attn_bias,
     [12](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:12)     ck,
     [13](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:13)     ck_decoder,
     [14](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:14)     ck_splitk,
     [15](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:15)     cutlass,
     [16](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:16)     flash,
     [17](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:17)     flash3,
     [18](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:18)     triton_splitk,
     [19](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:19) )
     [20](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:20) from .attn_bias import VARLEN_BIASES, AttentionBias, LowerTriangularMask
     [21](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:21) from .common import (
     [22](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:22)     AttentionBwOpBase,
     [23](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:23)     AttentionFwOpBase,
   (...)
     [29](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:29)     bmk2bmhk,
     [30](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:30) )

File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\triton_splitk.py:110
     [94](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:94)         return (
     [95](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:95)             super(InputsFp8, self).nbytes
     [96](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:96)             + (
   (...)
    [105](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:105)             )
    [106](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:106)         )
    [109](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:109) if TYPE_CHECKING or _is_triton_available():
--> [110](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:110)     from ._triton.splitk_kernels import _fwd_kernel_splitK, _splitK_reduce
    [111](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:111) else:
    [112](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:112)     _fwd_kernel_splitK = None

File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\_triton\splitk_kernels.py:632
    [629](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:629) if sys.version_info >= (3, 9):
    [630](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:630)     # unroll_varargs requires Python 3.9+
    [631](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:631)     for num_groups in [1, 2, 4, 8]:
--> [632](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:632)         _fwd_kernel_splitK_autotune[num_groups] = autotune_kernel(
    [633](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:633)             _get_splitk_kernel(num_groups)
    [634](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:634)         )
    [636](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:636)     def get_autotuner_cache(
    [637](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:637)         num_groups: int,
    [638](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:638)     ) -> Dict[Tuple[Union[int, str]], triton.Config]:
    [639](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:639)         """Returns a triton.runtime.autotuner.AutoTuner.cache object, which
    [640](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:640)         represents mappings from kernel autotune keys (tuples describing kernel inputs)
    [641](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:641)         to triton.Config
    [642](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:642)         """

File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\_triton\splitk_kernels.py:614, in autotune_kernel(kernel)
    [604](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:604) WARPS_VALUES = [1, 2, 4]
    [606](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:606) TRITON_CONFIGS = [
    [607](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:607)     gen_config(block_m, block_n, stages, warps)
    [608](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:608)     for block_m in BLOCK_M_VALUES
   (...)
    [611](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:611)     for warps in WARPS_VALUES
    [612](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:612) ]
--> [614](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:614) kernel = triton.autotune(
    [615](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:615)     configs=TRITON_CONFIGS,
    [616](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:616)     key=AUTOTUNER_KEY,
    [617](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:617)     use_cuda_graph=True,
    [618](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:618) )(kernel)
    [619](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:619) return kernel

TypeError: autotune() got an unexpected keyword argument 'use_cuda_graph'
DarkLight1337 commented 3 weeks ago

@DarkLight1337 has new error

TypeError                                 Traceback (most recent call last)
Cell In[2], [line 5](vscode-notebook-cell:?execution_count=2&line=5)
      [1](vscode-notebook-cell:?execution_count=2&line=1) from vllm import LLM, SamplingParams
      [3](vscode-notebook-cell:?execution_count=2&line=3) # model_dir='Qwen2.5-14B-Instruct-GPTQ-Int4'
----> [5](vscode-notebook-cell:?execution_count=2&line=5) llm = LLM(model=model_dir,enforce_eager=True)
      [6](vscode-notebook-cell:?execution_count=2&line=6) sampling_params = SamplingParams( top_p=0.9,  max_tokens=512,top_k=10)
      [8](vscode-notebook-cell:?execution_count=2&line=8) prompt = "1+1等于几"

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\utils.py:1023, in deprecate_args.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   [1016](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1016)             msg += f" {additional_message}"
   [1018](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1018)         warnings.warn(
   [1019](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1019)             DeprecationWarning(msg),
   [1020](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1020)             stacklevel=3,  # The inner function takes up one level
   [1021](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1021)         )
-> [1023](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/utils.py:1023) return fn(*args, **kwargs)

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\entrypoints\llm.py:198, in LLM.__init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, mm_processor_kwargs, task, **kwargs)
    [172](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:172)     kwargs["disable_log_stats"] = True
    [174](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:174) engine_args = EngineArgs(
    [175](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:175)     model=model,
    [176](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:176)     task=task,
   (...)
    [196](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:196)     **kwargs,
    [197](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:197) )
--> [198](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:198) self.llm_engine = LLMEngine.from_engine_args(
    [199](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:199)     engine_args, usage_context=UsageContext.LLM_CLASS)
    [200](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/entrypoints/llm.py:200) self.request_counter = Counter()

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\engine\llm_engine.py:582, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
    [580](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:580) executor_class = cls._get_executor_cls(engine_config)
    [581](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:581) # Create the LLM engine.
--> [582](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:582) engine = cls(
    [583](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:583)     **engine_config.to_dict(),
    [584](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:584)     executor_class=executor_class,
    [585](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:585)     log_stats=not engine_args.disable_log_stats,
    [586](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:586)     usage_context=usage_context,
    [587](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:587)     stat_loggers=stat_loggers,
    [588](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:588) )
    [590](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:590) return engine

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\engine\llm_engine.py:341, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers, input_registry, use_cached_outputs)
    [337](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:337) self.input_registry = input_registry
    [338](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:338) self.input_processor = input_registry.create_input_processor(
    [339](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:339)     model_config)
--> [341](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:341) self.model_executor = executor_class(
    [342](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:342)     model_config=model_config,
    [343](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:343)     cache_config=cache_config,
    [344](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:344)     parallel_config=parallel_config,
    [345](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:345)     scheduler_config=scheduler_config,
    [346](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:346)     device_config=device_config,
    [347](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:347)     lora_config=lora_config,
    [348](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:348)     speculative_config=speculative_config,
    [349](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:349)     load_config=load_config,
    [350](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:350)     prompt_adapter_config=prompt_adapter_config,
    [351](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:351)     observability_config=self.observability_config,
    [352](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:352) )
    [354](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:354) if self.model_config.task != "embedding":
    [355](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/engine/llm_engine.py:355)     self._initialize_kv_caches()

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\executor_base.py:47, in ExecutorBase.__init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, prompt_adapter_config, observability_config)
     [45](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:45) self.prompt_adapter_config = prompt_adapter_config
     [46](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:46) self.observability_config = observability_config
---> [47](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/executor_base.py:47) self._init_executor()

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:38, in GPUExecutor._init_executor(self)
     [33](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:33) """Initialize the worker and load the model.
     [34](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:34) """
     [35](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:35) assert self.parallel_config.world_size == 1, (
     [36](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:36)     "GPUExecutor only supports single GPU.")
---> [38](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:38) self.driver_worker = self._create_worker()
     [39](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:39) self.driver_worker.init_device()
     [40](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:40) self.driver_worker.load_model()

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:105, in GPUExecutor._create_worker(self, local_rank, rank, distributed_init_method)
    [101](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:101) def _create_worker(self,
    [102](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:102)                    local_rank: int = 0,
    [103](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:103)                    rank: int = 0,
    [104](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:104)                    distributed_init_method: Optional[str] = None):
--> [105](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:105)     return create_worker(**self._get_create_worker_kwargs(
    [106](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:106)         local_rank=local_rank,
    [107](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:107)         rank=rank,
    [108](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:108)         distributed_init_method=distributed_init_method))

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\executor\gpu_executor.py:24, in create_worker(worker_module_name, worker_class_name, worker_class_fn, **kwargs)
     [16](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:16) def create_worker(worker_module_name: str, worker_class_name: str,
     [17](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:17)                   worker_class_fn: Optional[Callable[[], Type[WorkerBase]]],
     [18](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:18)                   **kwargs):
     [19](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:19)     wrapper = WorkerWrapperBase(
     [20](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:20)         worker_module_name=worker_module_name,
     [21](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:21)         worker_class_name=worker_class_name,
     [22](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:22)         worker_class_fn=worker_class_fn,
     [23](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:23)     )
---> [24](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:24)     wrapper.init_worker(**kwargs)
     [25](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/executor/gpu_executor.py:25)     return wrapper.worker

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\worker_base.py:449, in WorkerWrapperBase.init_worker(self, *args, **kwargs)
    [446](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:446)     mod = importlib.import_module(self.worker_module_name)
    [447](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:447)     worker_class = getattr(mod, self.worker_class_name)
--> [449](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:449) self.worker = worker_class(*args, **kwargs)
    [450](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker_base.py:450) assert self.worker is not None

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\worker.py:99, in Worker.__init__(self, model_config, parallel_config, scheduler_config, device_config, cache_config, load_config, local_rank, rank, distributed_init_method, lora_config, speculative_config, prompt_adapter_config, is_driver_worker, model_runner_cls, observability_config)
     [97](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:97) elif self._is_encoder_decoder_model():
     [98](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:98)     ModelRunnerClass = EncoderDecoderModelRunner
---> [99](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:99) self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
    [100](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:100)     model_config,
    [101](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:101)     parallel_config,
    [102](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:102)     scheduler_config,
    [103](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:103)     device_config,
    [104](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:104)     cache_config,
    [105](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:105)     load_config=load_config,
    [106](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:106)     lora_config=self.lora_config,
    [107](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:107)     kv_cache_dtype=self.cache_config.cache_dtype,
    [108](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:108)     is_driver_worker=is_driver_worker,
    [109](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:109)     prompt_adapter_config=prompt_adapter_config,
    [110](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:110)     observability_config=observability_config,
    [111](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:111)     **speculative_args,
    [112](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:112) )
    [113](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:113) # Uninitialized cache engine. Will be initialized by
    [114](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:114) # initialize_cache.
    [115](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/worker.py:115) self.cache_engine: List[CacheEngine]

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\worker\model_runner.py:1013, in GPUModelRunnerBase.__init__(self, model_config, parallel_config, scheduler_config, device_config, cache_config, load_config, lora_config, kv_cache_dtype, is_driver_worker, prompt_adapter_config, return_hidden_states, observability_config, input_registry, mm_registry)
   [1008](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1008) num_attn_heads = self.model_config.get_num_attention_heads(
   [1009](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1009)     self.parallel_config)
   [1010](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1010) needs_attn_backend = (num_attn_heads != 0
   [1011](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1011)                       or self.model_config.is_attention_free)
-> [1013](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1013) self.attn_backend = get_attn_backend(
   [1014](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1014)     self.model_config.get_head_size(),
   [1015](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1015)     self.model_config.dtype,
   [1016](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1016)     self.kv_cache_dtype,
   [1017](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1017)     self.block_size,
   [1018](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1018)     self.model_config.is_attention_free,
   [1019](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1019) ) if needs_attn_backend else None
   [1020](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1020) if self.attn_backend:
   [1021](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1021)     self.attn_state = self.attn_backend.get_state_cls()(
   [1022](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/worker/model_runner.py:1022)         weakref.proxy(self))

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\attention\selector.py:120, in get_attn_backend(head_size, dtype, kv_cache_dtype, block_size, is_attention_free, is_blocksparse)
    [118](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:118) if backend == _Backend.XFORMERS:
    [119](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:119)     logger.info("Using XFormers backend.")
--> [120](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:120)     from vllm.attention.backends.xformers import (  # noqa: F401
    [121](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:121)         XFormersBackend)
    [122](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:122)     return XFormersBackend
    [123](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/selector.py:123) elif backend == _Backend.ROCM_FLASH:

File d:\my\env\python3.10.10\lib\site-packages\vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg\vllm\attention\backends\xformers.py:6
      [3](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:3) from typing import Any, Dict, List, Optional, Tuple, Type
      [5](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:5) import torch
----> [6](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:6) from xformers import ops as xops
      [7](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:7) from xformers.ops.fmha.attn_bias import (AttentionBias,
      [8](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:8)                                          BlockDiagonalCausalMask,
      [9](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:9)                                          BlockDiagonalMask,
     [10](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:10)                                          LowerTriangularMaskWithTensorBias)
     [12](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:12) from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl,
     [13](file:///D:/my/env/python3.10.10/lib/site-packages/vllm-0.6.3.post2.dev156+g04a3ae0a.d20241030-py3.10.egg/vllm/attention/backends/xformers.py:13)                                               AttentionMetadata, AttentionType)

File d:\my\env\python3.[1](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:1)0.10\lib\site-packages\xformers\ops\__init__.py:8
      1 # Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
      [2](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:2) #
      [3](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:3) # This source code is licensed under the BSD license found in the
      [4](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:4) # LICENSE file in the root directory of this source tree.
      [6](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:6) import torch
----> [8](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:8) from .fmha import (
      [9](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:9)     AttentionBias,
     [10](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:10)     AttentionOp,
     [11](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:11)     AttentionOpBase,
     [12](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:12)     LowerTriangularMask,
     [13](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:13)     MemoryEfficientAttentionCkOp,
     [14](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:14)     MemoryEfficientAttentionCutlassFwdFlashBwOp,
     [15](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:15)     MemoryEfficientAttentionCutlassOp,
     [16](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:16)     MemoryEfficientAttentionFlashAttentionOp,
     [17](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:17)     MemoryEfficientAttentionSplitKCkOp,
     [18](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:18)     memory_efficient_attention,
     [19](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:19)     memory_efficient_attention_backward,
     [20](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:20)     memory_efficient_attention_forward,
     [21](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:21)     memory_efficient_attention_forward_requires_grad,
     [22](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:22) )
     [23](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:23) from .indexing import index_select_cat, scaled_index_add
     [24](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/__init__.py:24) from .ipc import init_ipc

File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\__init__.py:10
      [6](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:6) from typing import Any, List, Optional, Sequence, Tuple, Type, Union, cast
      [8](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:8) import torch
---> [10](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:10) from . import (
     [11](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:11)     attn_bias,
     [12](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:12)     ck,
     [13](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:13)     ck_decoder,
     [14](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:14)     ck_splitk,
     [15](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:15)     cutlass,
     [16](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:16)     flash,
     [17](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:17)     flash3,
     [18](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:18)     triton_splitk,
     [19](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:19) )
     [20](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:20) from .attn_bias import VARLEN_BIASES, AttentionBias, LowerTriangularMask
     [21](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:21) from .common import (
     [22](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:22)     AttentionBwOpBase,
     [23](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:23)     AttentionFwOpBase,
   (...)
     [29](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:29)     bmk2bmhk,
     [30](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/__init__.py:30) )

File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\triton_splitk.py:110
     [94](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:94)         return (
     [95](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:95)             super(InputsFp8, self).nbytes
     [96](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:96)             + (
   (...)
    [105](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:105)             )
    [106](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:106)         )
    [109](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:109) if TYPE_CHECKING or _is_triton_available():
--> [110](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:110)     from ._triton.splitk_kernels import _fwd_kernel_splitK, _splitK_reduce
    [111](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:111) else:
    [112](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/triton_splitk.py:112)     _fwd_kernel_splitK = None

File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\_triton\splitk_kernels.py:632
    [629](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:629) if sys.version_info >= (3, 9):
    [630](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:630)     # unroll_varargs requires Python 3.9+
    [631](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:631)     for num_groups in [1, 2, 4, 8]:
--> [632](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:632)         _fwd_kernel_splitK_autotune[num_groups] = autotune_kernel(
    [633](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:633)             _get_splitk_kernel(num_groups)
    [634](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:634)         )
    [636](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:636)     def get_autotuner_cache(
    [637](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:637)         num_groups: int,
    [638](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:638)     ) -> Dict[Tuple[Union[int, str]], triton.Config]:
    [639](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:639)         """Returns a triton.runtime.autotuner.AutoTuner.cache object, which
    [640](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:640)         represents mappings from kernel autotune keys (tuples describing kernel inputs)
    [641](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:641)         to triton.Config
    [642](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:642)         """

File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\_triton\splitk_kernels.py:614, in autotune_kernel(kernel)
    [604](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:604) WARPS_VALUES = [1, 2, 4]
    [606](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:606) TRITON_CONFIGS = [
    [607](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:607)     gen_config(block_m, block_n, stages, warps)
    [608](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:608)     for block_m in BLOCK_M_VALUES
   (...)
    [611](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:611)     for warps in WARPS_VALUES
    [612](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:612) ]
--> [614](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:614) kernel = triton.autotune(
    [615](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:615)     configs=TRITON_CONFIGS,
    [616](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:616)     key=AUTOTUNER_KEY,
    [617](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:617)     use_cuda_graph=True,
    [618](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:618) )(kernel)
    [619](file:///D:/my/env/python3.10.10/lib/site-packages/xformers/ops/fmha/_triton/splitk_kernels.py:619) return kernel

TypeError: autotune() got an unexpected keyword argument 'use_cuda_graph'

This looks like a problem inside xformers. Maybe you should use other backends by setting VLLM_ATTENTION_BACKEND (a list of options can be found here).

xiezhipeng-git commented 3 weeks ago

@DarkLight1337 like this? os.environ["VLLM_ATTENTION_BACKEND"] = "FLASH_ATTN" it is doesn't work WARNING 10-30 15:04:01 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") WARNING 10-30 15:04:08 config.py:438] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used INFO 10-30 15:04:08 llm_engine.py:243] Initializing an LLM engine (v0.6.3.post2.dev156+g04a3ae0a.d20241030) with config: model='C:\Users\Admin\.cache\modelscope\hub\Qwen\Qwen2_5-7B-Instruct', speculativeconfig=None, tokenizer='C:\Users\Admin\.cache\modelscope\hub\Qwen\Qwen25-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=C:\Users\Admin.cache\modelscope\hub\Qwen\Qwen2___5-7B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None)

INFO 10-30 15:04:09 selector.py:267] Cannot use FlashAttention-2 backend because the vllm.vllm_flash_attn package is not found. Make sure that vllm_flash_attn was built and installed (on by default). INFO 10-30 15:04:09 selector.py:119] Using XFormers backend. but I already install flash-attention

pip show flash-attn
Name: flash_attn
Version: 2.6.3
Summary: Flash Attention: Fast and Memory-Efficient Exact Attention
Home-page: https://github.com/Dao-AILab/flash-attention
Author: Tri Dao
Author-email: tri@tridao.me
License:
Location: d:\my\env\python3.10.10\lib\site-packages
Requires: einops, torch
Required-by:
PS D:\my\work\study\ai\kaggle_code\arc\kaggle_arc_2024>
DarkLight1337 commented 3 weeks ago

Can you use pytorch SDPA?

xiezhipeng-git commented 3 weeks ago

what is pytorch SDPA? pip show torch Name: torch Version: 2.5.0+cu124 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3-Clause Location: d:\my\env\python3.10.10\lib\site-packages Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions Required-by: accelerate, auto_gptq, bitsandbytes, compressed-tensors, deepspeed, encodec, flash_attn, optimum, peft, stable-baselines3, timm , torchaudio, torchvision, trl, vector-quantize-pytorch, vocos, xformers

DarkLight1337 commented 3 weeks ago

It is built into pytorch, so you should be able to use it as long as pytorch is installed.

xiezhipeng-git commented 3 weeks ago

Now the question is Cannot find FlashAttention-2 . I'm guessing it has nothing to do with sdpa.

xiezhipeng-git commented 3 weeks ago

has a sapa sample ? and this info "because the vllm.vllm_flash_attn package" I need install vllm other tool?

DarkLight1337 commented 3 weeks ago

This is where I'm unable to really help you. I guess vLLM's flash attention package only works on Linux.

DarkLight1337 commented 3 weeks ago

Maybe @dtrifiro can provide some insights here?

xiezhipeng-git commented 3 weeks ago

But flash-attion already supports windows.So does vllm's flashatten need to be regenerated.Or tell me how to generate it using the source code

xiezhipeng-git commented 3 weeks ago

image I found different color and Found https://github.com/Dao-AILab/flash-attention/issues/1066 Is new version delete it?how to change? DarkLight1337 @dtrifiro

xiezhipeng-git commented 3 weeks ago
from flash_attn.flash_attn_interface import flash_attn_func
from flash_attn.flash_attn_interface import flash_attn_with_kvcache
import torch

def main():
    batch_size = 2
    seqlen_q = 1
    seqlen_k = 1
    nheads = 4
    n_kv_heads = 2
    d = 3
    device = "cuda"
    causal = True
    window_size = (-1, -1)
    dtype = torch.float16
    paged_kv_cache_size = None
    cache_seqlens = None
    rotary_cos = None
    rotary_sin = None
    cache_batch_idx = None
    block_table = None
    softmax_scale = None
    rotary_interleaved = False
    alibi_slopes = None
    num_splits = 0
    max_seq_len = 3
    if paged_kv_cache_size is None:
        k_cache = torch.zeros(batch_size, max_seq_len, n_kv_heads, d, device=device, dtype=dtype)
        v_cache = torch.zeros(batch_size, max_seq_len, n_kv_heads, d, device=device, dtype=dtype)
        block_table = None

    prev_q_vals = []
    prev_k_vals = []
    prev_v_vals = []
    torch.manual_seed(0)
    for i in range(0,3):

        print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
        q = torch.randn(batch_size, seqlen_q, nheads, d, device=device, dtype=dtype)
        k = torch.randn(batch_size, seqlen_k, n_kv_heads, d, device=device, dtype=dtype)
        v = torch.randn(batch_size, seqlen_k, n_kv_heads, d, device=device, dtype=dtype)

        # kv cache
        cache_seqlens = torch.tensor([i] * batch_size, dtype=torch.int32, device=device)
        output_kvcache = flash_attn_with_kvcache(
        q = q,
        k_cache = k_cache,
        v_cache = v_cache,
        k = k,
        v = v,
        rotary_cos = rotary_cos,
        rotary_sin = rotary_sin,
        cache_seqlens = cache_seqlens,
        cache_batch_idx = cache_batch_idx,
        cache_leftpad = None,
        block_table = block_table,
        softmax_scale = softmax_scale,
        causal = causal,
        window_size = window_size,
        softcap = 0.0, 
        rotary_interleaved = rotary_interleaved,
        alibi_slopes = alibi_slopes,
        num_splits = num_splits,
        return_softmax_lse = False)

        print(f"$$$ output KV CACHE MHA at {i} \n", output_kvcache)

        # non kv cache MHA
        prev_q_vals.append(q)
        prev_k_vals.append(k)
        prev_v_vals.append(v)
        output_2 = flash_attn_func(
            q=q, 
            k=torch.concat(prev_k_vals, axis=1), 
            v=torch.concat(prev_v_vals, axis=1), 
            dropout_p=0.0,
            softmax_scale=None, 
            causal=causal, 
            window_size=window_size,
            softcap=0.0,
            alibi_slopes=None,
            deterministic=False,
            return_attn_probs=False)

        print(f"!!! output MHA NON KV CACHE at {i} \n", output_2)

main()

$$$ output KV CACHE MHA at 0 tensor([[[[ 2.5449, -0.7163, -0.4934], [ 2.5449, -0.7163, -0.4934], [ 0.1267, 0.1014, -0.4036], [ 0.1267, 0.1014, -0.4036]]],

    [[[ 0.9023,  0.8101, -0.6885],
      [ 0.9023,  0.8101, -0.6885],
      [ 0.1372,  1.0381,  0.0925],
      [ 0.1372,  1.0381,  0.0925]]]], device='cuda:0', dtype=torch.float16)

!!! output MHA NON KV CACHE at 0 tensor([[[[ 2.5449, -0.7163, -0.4934], [ 2.5449, -0.7163, -0.4934], [ 0.1267, 0.1014, -0.4036], [ 0.1267, 0.1014, -0.4036]]],

    [[[ 0.9023,  0.8101, -0.6885],
      [ 0.9023,  0.8101, -0.6885],
      [ 0.1372,  1.0381,  0.0925],
      [ 0.1372,  1.0381,  0.0925]]]], device='cuda:0', dtype=torch.float16)

$$$ output KV CACHE MHA at 1 tensor([[[[ 1.8740, -0.3555, -0.2308], [ 1.8223, -0.3279, -0.2108], [ 0.6812, -0.3042, 0.1327], [ 0.8237, -0.4082, 0.2703]]],

    [[[ 0.0036, -0.6611, -1.3848],
      [ 0.2605, -0.2406, -1.1865],
      [ 0.1748,  0.3794, -0.1744],
      [ 0.2352, -0.6782, -0.6030]]]], device='cuda:0', dtype=torch.float16)

!!! output MHA NON KV CACHE at 1 tensor([[[[ 1.8740, -0.3555, -0.2308], [ 1.8223, -0.3279, -0.2108], [ 0.6812, -0.3042, 0.1327], [ 0.8237, -0.4082, 0.2703]]],

    [[[ 0.0036, -0.6611, -1.3848],
      [ 0.2605, -0.2406, -1.1865],
      [ 0.1748,  0.3794, -0.1744],
      [ 0.2352, -0.6782, -0.6030]]]], device='cuda:0', dtype=torch.float16)

$$$ output KV CACHE MHA at 2 tensor([[[[-0.2815, 0.2520, -0.2242], [ 0.1653, 0.0293, -0.3726], [ 0.5005, -0.0624, -0.0492], [ 0.3440, 0.3044, -0.2172]]],

    [[[ 0.2651, -0.1628, -1.2080],
      [ 0.6064,  0.4153, -0.9517],
      [ 0.7690,  0.0339,  0.0311],
      [ 0.7075, -0.0425, -0.0394]]]], device='cuda:0', dtype=torch.float16)

!!! output MHA NON KV CACHE at 2 tensor([[[[-0.2815, 0.2520, -0.2242], [ 0.1653, 0.0293, -0.3726], [ 0.5005, -0.0624, -0.0492], [ 0.3440, 0.3044, -0.2172]]],

    [[[ 0.2651, -0.1628, -1.2080],
      [ 0.6064,  0.4153, -0.9517],
      [ 0.7690,  0.0339,  0.0311],
      [ 0.7075, -0.0425, -0.0394]]]], device='cuda:0', dtype=torch.float16)

But I can run success.

DarkLight1337 commented 3 weeks ago

vLLM uses a fork of the flash_attn repo which can be found here

xiezhipeng-git commented 3 weeks ago

ImportError("cannot import name 'flash_attn_varlen_func' from 'vllm.vllm_flash_attn' (unknown location)")

xiezhipeng-git commented 3 weeks ago

image I can find on flash-atttion. but vllm_flash-attion not so It is vllm 0.6.3 error?@DarkLight1337@dtrifiro

DarkLight1337 commented 3 weeks ago

I can find on flash-atttion. but vllm_flash-attion not so It is vllm 0.6.3 error?

It's listed in this file: https://github.com/vllm-project/flash-attention/blob/5259c586c403a4e4d8bf69973c159b40cc346fb9/vllm_flash_attn/__init__.py

xiezhipeng-git commented 3 weeks ago

@DarkLight1337 you mean I need replace all vllm_flash_attn files?why not update in vllm-project?

DarkLight1337 commented 3 weeks ago

I am not sure what you mean. Those functions are defined inside vllm_flash_attn as well.

xiezhipeng-git commented 3 weeks ago

现在 在vllm-project 工程下的vllm_flash_attn文件内容与 https://github.com/vllm-project/flash-attention 文件的内容是不同的。他们属于不同的版本 也就是说在vllm-project的main 源代码版本中,没有更新最新的vllm_flash_attn Now the contents of the vllm_flash_attn file under the vllm-project project are the differnt as https://github.com/vllm-project/flash-attention .They belong to different versions, which means that the main source version of vllm-project is not updated with the latest version of vllm_flash_attn.

DarkLight1337 commented 3 weeks ago

After you clone vLLM repo, you should build from source using the provided instructions (in your case, better perform a full build to make sure you have the latest version of the compiled binaries). It should download the files from the vLLM flash-attention fork and copy them into the main vLLM repo.

xiezhipeng-git commented 3 weeks ago

现在我就是从源代码安装的。但是克隆vLLM repo 以后。他们的源代码是不一样的 Right now I'm installing from source.But after cloning the vLLM repo.Their source code is different

DarkLight1337 commented 3 weeks ago

现在我就是从源代码安装的。但是克隆vLLM repo 以后。他们的源代码是不一样的 Right now I'm installing from source.But after cloning the vLLM repo.Their source code is different

In vLLM main repo, the vllm_flash_attn directory should be initially empty like this. If this isn't the case, you can try deleting those files and rebuild vLLM to make sure you get the updated version.

xiezhipeng-git commented 3 weeks ago

我这里也是空的。也就是说vllm_flash_attn 没有被作为子项目。我需要先手动克隆vllm_flash_attnhttps://github.com/vllm-project/flash-attention项目 然后重新安装?

DarkLight1337 commented 3 weeks ago

我这里也是空的。也就是说vllm_flash_attn 没有被作为子项目。我需要先手动克隆vllm_flash_attnhttps://github.com/vllm-project/flash-attention项目 然后重新安装?

How are you installing vLLM from source? Can you show the commands which you've used?

xiezhipeng-git commented 3 weeks ago

@dtrifiro 问题就在这个vllm-project/flash-attention了。 这个项目强制要求torch 版本为2.4.0 而且强行安装torch2.4.0(根本不应该在这里安装torch,应该报错,让用户自己安装) 并且默认找到最高版本的python 而我本机有torch包的python 是3.10.10 不是最高版本。并且我不知道修改哪里。一打开工程就自动生成CMakeLists.txt了。直接指定了python版本为我本地的3.12.4版本 实在不知道怎么修改启动的python版本。卸载了高版本的python 开始编译了

xiezhipeng-git commented 3 weeks ago

@DarkLight1337 when I build success and get vllm_flash_attn_c.pyd lib exp.Then how can I use these? image

DarkLight1337 commented 3 weeks ago

This is outside of my domain as I'm not involved with the vLLM build process. @dtrifiro may be able to help you more.

xiezhipeng-git commented 3 weeks ago

The problem has not been resolved. Need to reopen. Also, can you help me contact dtrifiro Danielea? Only he or their project team can solve it. But @ him, he doesn't respond

------------------ 原始邮件 ------------------ 发件人: "Simon @.>; 发送时间: 2024年10月29日(星期二) 中午1:08 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [vllm-project/vllm] [Installation] pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows (Issue #9701)

Closed #9701 as completed via #9721.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

YiFraternity commented 3 weeks ago

中国区网络环境不好。不想在折腾torch的版本。之前使用pip install vllm 安装成功以后。发现了torch 被替换成了cpu的这个问题。不过在装完cuda的torch以后我运行过一次vllm。然后报错了, 错误如下(不是很确定后来这个错误有没有被覆盖) WARNING 10-24 21:42:41 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") ERROR 10-24 21:42:49 registry.py:267] Error in inspecting model architecture 'Qwen2ForCausalLM' ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last): ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 429, in _run_in_subprocess ERROR 10-24 21:42:49 registry.py:267] returned.check_returncode() ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\subprocess.py", line 457, in check_returncode ERROR 10-24 21:42:49 registry.py:267] raise CalledProcessError(self.returncode, self.args, self.stdout, ERROR 10-24 21:42:49 registry.py:267] subprocess.CalledProcessError: Command '['d:\my\env\python3.10.10\python.exe', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1. ERROR 10-24 21:42:49 registry.py:267] ERROR 10-24 21:42:49 registry.py:267] The above exception was the direct cause of the following exception: ERROR 10-24 21:42:49 registry.py:267] ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last): ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 265, in _try_inspect_model_cls ERROR 10-24 21:42:49 registry.py:267] return model.inspect_model_cls() ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 227, in inspect_model_cls ERROR 10-24 21:42:49 registry.py:267] return _run_in_subprocess( ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 432, in _run_in_subprocess ERROR 10-24 21:42:49 registry.py:267] raise RuntimeError(f"Error raised in subprocess:\n" ERROR 10-24 21:42:49 registry.py:267] RuntimeError: Error raised in subprocess: ERROR 10-24 21:42:49 registry.py:267] d:\my\env\python3.10.10\lib\runpy.py:126: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour ERROR 10-24 21:42:49 registry.py:267] warn(RuntimeWarning(msg)) ERROR 10-24 21:42:49 registry.py:267] Traceback (most recent call last): ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\runpy.py", line 196, in _run_module_as_main ERROR 10-24 21:42:49 registry.py:267] return _run_code(code, main_globals, None, ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\runpy.py", line 86, in _run_code ERROR 10-24 21:42:49 registry.py:267] exec(code, run_globals) ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 453, in ERROR 10-24 21:42:49 registry.py:267] _run() ERROR 10-24 21:42:49 registry.py:267] File "d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py", line 448, in _run ERROR 10-24 21:42:49 registry.py:267] with open(output_file, "wb") as f: ERROR 10-24 21:42:49 registry.py:267] PermissionError: [Errno 13] Permission denied: 'C:\Users\Admin\AppData\Local\Temp\tmp6cx7k05c' ERROR 10-24 21:42:49 registry.py:267] ValueError Traceback (most recent call last) Cell In[2], line 5 1 from vllm import LLM, SamplingParams 3 # model_dir='Qwen2.5-14B-Instruct-GPTQ-Int4' ----> 5 llm = LLM(model=model_dir,enforce_eager=True) 6 sampling_params = SamplingParams( top_p=0.9, max_tokens=512,top_k=10) 8 prompt = "1+1等于几"

File d:\my\env\python3.10.10\lib\site-packages\vllm\entrypoints\llm.py:177, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, mm_processor_kwargs, kwargs) 152 kwargs["disable_log_stats"] = True 154 engine_args = EngineArgs( 155 model=model, 156 tokenizer=tokenizer, (...) 175 kwargs, 176 ) --> 177 self.llm_engine = LLMEngine.from_engine_args( 178 engine_args, usage_context=UsageContext.LLM_CLASS) 179 self.request_counter = Counter()

File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\llm_engine.py:570, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers) 568 """Creates an LLM engine from the engine arguments.""" 569 # Create the engine configs. --> 570 engine_config = engine_args.create_engine_config() 571 executor_class = cls._get_executor_cls(engine_config) 572 # Create the LLM engine.

File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\arg_utils.py:903, in EngineArgs.create_engine_config(self) 898 assert self.cpu_offload_gb >= 0, ( 899 "CPU offload space must be non-negative" 900 f", but got {self.cpu_offload_gb}") 902 device_config = DeviceConfig(device=self.device) --> 903 model_config = self.create_model_config() 905 if model_config.is_multimodal_model: 906 if self.enable_prefix_caching:

File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\arg_utils.py:839, in EngineArgs.create_model_config(self) 838 def create_model_config(self) -> ModelConfig: --> 839 return ModelConfig( 840 model=self.model, 841 # We know this is not None because we set it in post_init 842 tokenizer=cast(str, self.tokenizer), 843 tokenizer_mode=self.tokenizer_mode, 844 trust_remote_code=self.trust_remote_code, 845 dtype=self.dtype, 846 seed=self.seed, 847 revision=self.revision, 848 code_revision=self.code_revision, 849 rope_scaling=self.rope_scaling, 850 rope_theta=self.rope_theta, 851 tokenizer_revision=self.tokenizer_revision, 852 max_model_len=self.max_model_len, 853 quantization=self.quantization, 854 quantization_param_path=self.quantization_param_path, 855 enforce_eager=self.enforce_eager, 856 max_context_len_to_capture=self.max_context_len_to_capture, 857 max_seq_len_to_capture=self.max_seq_len_to_capture, 858 max_logprobs=self.max_logprobs, 859 disable_sliding_window=self.disable_sliding_window, 860 skip_tokenizer_init=self.skip_tokenizer_init, 861 served_model_name=self.served_model_name, 862 limit_mm_per_prompt=self.limit_mm_per_prompt, 863 use_async_output_proc=not self.disable_async_output_proc, 864 override_neuron_config=self.override_neuron_config, 865 config_format=self.config_format, 866 mm_processor_kwargs=self.mm_processor_kwargs, 867 )

File d:\my\env\python3.10.10\lib\site-packages\vllm\config.py:200, in ModelConfig.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, dtype, seed, revision, code_revision, rope_scaling, rope_theta, tokenizer_revision, max_model_len, spec_target_max_model_len, quantization, quantization_param_path, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, max_logprobs, disable_sliding_window, skip_tokenizer_init, served_model_name, limit_mm_per_prompt, use_async_output_proc, override_neuron_config, config_format, mm_processor_kwargs) 192 self.max_model_len = _get_and_verify_max_len( 193 hf_config=self.hf_text_config, 194 max_model_len=max_model_len, 195 disable_sliding_window=self.disable_sliding_window, 196 sliding_window_len=self.get_hf_config_sliding_window(), 197 spec_target_max_model_len=spec_target_max_model_len) 198 self.served_model_name = get_served_model_name(model, 199 served_model_name) --> 200 self.multimodal_config = self._init_multimodal_config( 201 limit_mm_per_prompt) 202 if not self.skip_tokenizer_init: 203 self._verify_tokenizer_mode()

File d:\my\env\python3.10.10\lib\site-packages\vllm\config.py:219, in ModelConfig._init_multimodal_config(self, limit_mm_per_prompt) 215 def _init_multimodal_config( 216 self, limit_mm_per_prompt: Optional[Mapping[str, int]] 217 ) -> Optional["MultiModalConfig"]: 218 architectures = getattr(self.hf_config, "architectures", []) --> 219 if ModelRegistry.is_multimodal_model(architectures): 220 return MultiModalConfig(limit_per_prompt=limit_mm_per_prompt or {}) 222 if limit_mm_per_prompt:

File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:387, in _ModelRegistry.is_multimodal_model(self, architectures) 383 def is_multimodal_model( 384 self, 385 architectures: Union[str, List[str]], 386 ) -> bool: --> 387 return self.inspect_model_cls(architectures).supports_multimodal

File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:356, in _ModelRegistry.inspect_model_cls(self, architectures) 353 if model_info is not None: 354 return model_info --> 356 return self._raise_for_unsupported(architectures)

File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:317, in _ModelRegistry._raise_for_unsupported(self, architectures) 314 def _raise_for_unsupported(self, architectures: List[str]): 315 all_supported_archs = self.get_supported_archs() --> 317 raise ValueError( 318 f"Model architectures {architectures} are not supported for now. " 319 f"Supported architectures: {all_supported_archs}")

ValueError: Model architectures ['Qwen2ForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'Gemma2Model', 'MistralModel', 'Qwen2ForRewardModel', 'Phi3VForCausalLM', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel']

一样的问题

xiezhipeng-git commented 3 weeks ago

-- USE_CUDNN is set to 0. Compiling without cuDNN support -- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support -- USE_CUDSS is set to 0. Compiling without cuDSS support -- USE_CUFILE is set to 0. Compiling without cuFile support -- Autodetected CUDA architecture(s): 8.9 -- Added CUDA NVCC flags for: -gencode;arch=compute_89,code=sm_89 -- CUDA supported arches: 8.0;8.6;8.9;9.0 -- CUDA target arches: 89-real -- Configuring done (7.5s) -- Generating done (0.1s) -- Build files have been written to: D:/my/work/LLM/vllm/vllm-flash-attention/flash-attention Error: could not load cache Traceback (most recent call last): File "D:\my\work\LLM\vllm\vllm-flash-attention\flash-attention\setup.py", line 325, in setup( File "D:\my\env\python3.10.10\lib\site-packages\setuptools__init__.py", line 117, in setup return distutils.core.setup(*attrs) File "D:\my\env\python3.10.10\lib\site-packages\setuptools_distutils\core.py", line 183, in setup return run_commands(dist) File "D:\my\env\python3.10.10\lib\site-packages\setuptools_distutils\core.py", line 199, in run_commands dist.run_commands() File "D:\my\env\python3.10.10\lib\site-packages\setuptools_distutils\dist.py", line 954, in run_commands self.run_command(cmd) File "D:\my\env\python3.10.10\lib\site-packages\setuptools\dist.py", line 999, in run_command super().run_command(command) File "D:\my\env\python3.10.10\lib\site-packages\setuptools_distutils\dist.py", line 973, in run_command cmd_obj.run() File "D:\my\env\python3.10.10\lib\site-packages\setuptools\command\bdist_wheel.py", line 410, in run self.run_command("build") File "D:\my\env\python3.10.10\lib\site-packages\setuptools_distutils\cmd.py", line 316, in run_command self.distribution.run_command(command) File "D:\my\env\python3.10.10\lib\site-packages\setuptools\dist.py", line 999, in run_command super().run_command(command) File "D:\my\env\python3.10.10\lib\site-packages\setuptools_distutils\dist.py", line 973, in run_command cmd_obj.run() File "D:\my\env\python3.10.10\lib\site-packages\setuptools_distutils\command\build.py", line 135, in run self.run_command(cmd_name) File "D:\my\env\python3.10.10\lib\site-packages\setuptools_distutils\cmd.py", line 316, in run_command self.distribution.run_command(command) File "D:\my\env\python3.10.10\lib\site-packages\setuptools\dist.py", line 999, in run_command super().run_command(command) File "D:\my\env\python3.10.10\lib\site-packages\setuptools_distutils\dist.py", line 973, in run_command cmd_obj.run() File "D:\my\env\python3.10.10\lib\site-packages\setuptools\command\build_ext.py", line 98, in run _build_ext.run(self) File "D:\my\env\python3.10.10\lib\site-packages\Cython\Distutils\old_build_ext.py", line 186, in run _build_ext.build_ext.run(self) File "D:\my\env\python3.10.10\lib\site-packages\setuptools_distutils\command\build_ext.py", line 359, in run self.build_extensions() File "D:\my\work\LLM\vllm\vllm-flash-attention\flash-attention\setup.py", line 257, in build_extensions subprocess.check_call(["cmake", build_args], cwd=self.build_temp) File "D:\my\env\python3.10.10\lib\subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['cmake', '--build', '.', '-j=32', '--target=vllm_flash_attn_c']' returned non-zero exit status 1. @dtrifiro I can't reslove it .how can I do?

zhezhan commented 2 weeks ago

it is not resolved for now.
I have torch 2.5.1+cu121, vllm still will re-install torch 2.5.0, it will broken my torch to use cuda.