Open neviaumi opened 3 weeks ago
I also test using Official Phi3.5. It produce same error.
StackTrace
(experimental-vllm-3.12) ➜ experimental-vllm git:(main) pdm run ./src/experimental_vllm/main.py
WARNING 10-29 12:25:22 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 10-29 12:25:22 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
INFO 10-29 12:25:34 config.py:1664] Downcasting torch.float32 to torch.float16.
WARNING 10-29 12:25:43 config.py:321] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 10-29 12:25:43 config.py:380] Async output processing is only supported for CUDA, TPU, XPU. Disabling it for other platforms.
INFO 10-29 12:25:43 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/Users/davidng/WebstormProjects/experimental-vllm/.model/phi-3.5-gguf/Phi-3.5-mini-instruct-Q4_K_M.gguf', speculative_config=None, tokenizer='microsoft/Phi-3.5-mini-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/Users/davidng/WebstormProjects/experimental-vllm/.model/phi-3.5-gguf/Phi-3.5-mini-instruct-Q4_K_M.gguf, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
WARNING 10-29 12:25:50 cpu_executor.py:327] float16 is not supported on CPU, casting to bfloat16.
WARNING 10-29 12:25:50 cpu_executor.py:332] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 10-29 12:25:50 cpu_executor.py:362] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 10-29 12:25:50 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-29 12:25:50 selector.py:115] Using XFormers backend.
Traceback (most recent call last):
File "/Users/davidng/WebstormProjects/experimental-vllm/./src/experimental_vllm/main.py", line 6, in <module>
llm = LLM(model=model, device="cpu", tokenizer="microsoft/Phi-3.5-mini-instruct", max_model_len=8192)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 177, in __init__
self.llm_engine = LLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 573, in from_engine_args
engine = cls(
^^^^
File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 334, in __init__
self.model_executor = executor_class(
^^^^^^^^^^^^^^^
File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/executor/cpu_executor.py", line 97, in _init_executor
self.driver_worker = self._create_worker()
^^^^^^^^^^^^^^^^^^^^^
File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/executor/cpu_executor.py", line 155, in _create_worker
wrapper.init_worker(**kwargs)
File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 449, in init_worker
self.worker = worker_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/worker/cpu_worker.py", line 169, in __init__
self.model_runner: CPUModelRunner = ModelRunnerClass(
^^^^^^^^^^^^^^^^^
File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/worker/cpu_model_runner.py", line 421, in __init__
self.attn_backend = get_attn_backend(
^^^^^^^^^^^^^^^^^
File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/attention/selector.py", line 116, in get_attn_backend
from vllm.attention.backends.xformers import ( # noqa: F401
File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/attention/backends/xformers.py", line 6, in <module>
from xformers import ops as xops
ModuleNotFoundError: No module named 'xformers'
Your current environment
How would you like to use vllm
I want to run inference of a [Phi-3.5 GGUF](https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF. I don't know how to integrate it with vllm.
Here is my attempt:
Source here
Before submitting a new issue...