vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.66k stars 4.65k forks source link

[Usage]: Running Phi3.5 on Intel x86 MacBook Pro? #9795

Open neviaumi opened 3 weeks ago

neviaumi commented 3 weeks ago

Your current environment

Collecting environment information...
WARNING 10-29 12:20:54 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 10-29 12:20:54 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
PyTorch version: 2.2.2
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 12.7.6 (x86_64)
GCC version: Could not collect
Clang version: 12.0.0 (clang-1200.0.26.2)
CMake version: Could not collect
Libc version: N/A

Python version: 3.12.4 (main, Oct 26 2024, 19:58:57) [Clang 12.0.0 (clang-1200.0.26.2)] (64-bit runtime)
Python platform: macOS-12.7.6-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz

Versions of relevant libraries:
[pdm] numpy==1.26.4
[pdm] nvidia-cublas-cu12==12.1.3.1; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cuda-cupti-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cuda-runtime-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cudnn-cu12==8.9.2.26; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cufft-cu12==11.0.2.54; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-curand-cu12==10.3.2.106; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cusolver-cu12==11.4.5.107; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cusparse-cu12==12.1.0.106; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-ml-py==12.560.30
[pdm] nvidia-nccl-cu12==2.19.3; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-nvjitlink-cu12==12.6.77; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-nvtx-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] pyzmq==26.2.0
[pdm] torch==2.2.2
[pdm] torchvision==0.17.2
[pdm] transformers==4.46.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

How would you like to use vllm

I want to run inference of a [Phi-3.5 GGUF](https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF. I don't know how to integrate it with vllm.

Here is my attempt:

from vllm import LLM
import os

model = f"{os.path.join(os.getcwd(), ".model", "phi-3.5-gguf", "Phi-3.5-mini-instruct-Q4_K_M.gguf")}"

llm = LLM(model=model, device="cpu", tokenizer="microsoft/Phi-3.5-mini-instruct", max_model_len=8192)

outputs = llm.generate("""<|system|>
You are AI system that able to extract country from address and understand the boundaries of Country.<|end|>
<|user|>
Extract country from given address and report do the country extracted within United Kingdom? Generated response in JSON Object format with 2 key, 'extractedCountry' (string) and 'withInUK' (boolean) Given Address: Heineken UK Limited,3-4 Broadway Park,Edinburgh,EH12 9JZ.
<|end|>
<|assistant|>
{"extractedCountry":"Scotland","withInUK":true}<|end|>
<|user|>
Given Address: Brewed at:Sharp's Brewery Ltd.,Rock,Cornwall,PL27 6NU,UK.MCBC (Ireland) DAC,Block J1 Unit Centre,Maynooth Business Campus,Straffan Road,Republic of Ireland.<|end|>
<|assistant|>
{"extractedCountry":"Republic of Ireland","withInUK":false}<|end|>
<|user|>
Given Address: Brewed and Canned by:Birra Peroni S.r.l.,Via Birolli,8 - Roma,Italy.For:Asahi UK Ltd,Asahi House,88-100 Chertsey Road,Woking,GU21 5BJ,UK.<|end|>
<|assistant|>
{"extractedCountry":"Italy","withInUK":false}<|end|>
<|user|>
Given Address: Brewed & canned by:Camden Town Brewery,55-59 Wilkin Street,Mews,NW5 3NN,London,UK.<|end|>
<|assistant|>
{"extractedCountry":"England","withInUK":true}<|end|>
<|user|>
Given Address: Jubel Ltd,170 Kennington Lane,London,SE11 5DP.<|end|>
<|assistant|>
{"extractedCountry":"England","withInUK":true}<|end|>
<|user|>
Given Address: Brewed by:Heineken UK Limited,3-4 Broadway Park,Edinburgh,EH12 9JZ.HBBV.,Tweede Weteringplantsoen 21,1017 ZD Amsterdam,NL.<|end|>
<|assistant|>
{"extractedCountry":"Netherlands","withInUK":false}<|end|>
<|user|>
Given Address: Brewed & canned by:Camden Town Brewery,55-59 Wilkin Street,Mews,NW5 3NN,London,UK.<|end|>
<|assistant|>
{"extractedCountry":"England","withInUK":true}<|end|>
<|user|>
Given Address: Brewed and bottled by: Birra Peroni S.r.l., Via Birolli, 8, Roma. Asahi UK Ltd, Asahi House, 88-100 Chertsey Road, Woking, GU21 5BJ, UK.<|end|>
<|assistant|>
{"extractedCountry":"Italy","withInUK":false}<|end|>
<|user|>
Given Address: Specially manufactured for:Empire Bespoke Foods Ltd.,UK: Middlesex,UB5 6AG.ROI: Cork,T12 H1XY.<|end|>
<|assistant|>
""")

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

Source here

Before submitting a new issue...

neviaumi commented 3 weeks ago

I also test using Official Phi3.5. It produce same error.

StackTrace

(experimental-vllm-3.12) ➜  experimental-vllm git:(main) pdm run ./src/experimental_vllm/main.py                                     
WARNING 10-29 12:25:22 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 10-29 12:25:22 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
INFO 10-29 12:25:34 config.py:1664] Downcasting torch.float32 to torch.float16.
WARNING 10-29 12:25:43 config.py:321] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 10-29 12:25:43 config.py:380] Async output processing is only supported for CUDA, TPU, XPU. Disabling it for other platforms.
INFO 10-29 12:25:43 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/Users/davidng/WebstormProjects/experimental-vllm/.model/phi-3.5-gguf/Phi-3.5-mini-instruct-Q4_K_M.gguf', speculative_config=None, tokenizer='microsoft/Phi-3.5-mini-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/Users/davidng/WebstormProjects/experimental-vllm/.model/phi-3.5-gguf/Phi-3.5-mini-instruct-Q4_K_M.gguf, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
WARNING 10-29 12:25:50 cpu_executor.py:327] float16 is not supported on CPU, casting to bfloat16.
WARNING 10-29 12:25:50 cpu_executor.py:332] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 10-29 12:25:50 cpu_executor.py:362] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 10-29 12:25:50 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-29 12:25:50 selector.py:115] Using XFormers backend.
Traceback (most recent call last):
  File "/Users/davidng/WebstormProjects/experimental-vllm/./src/experimental_vllm/main.py", line 6, in <module>
    llm = LLM(model=model, device="cpu", tokenizer="microsoft/Phi-3.5-mini-instruct", max_model_len=8192)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 177, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 573, in from_engine_args
    engine = cls(
             ^^^^
  File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 334, in __init__
    self.model_executor = executor_class(
                          ^^^^^^^^^^^^^^^
  File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/executor/cpu_executor.py", line 97, in _init_executor
    self.driver_worker = self._create_worker()
                         ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/executor/cpu_executor.py", line 155, in _create_worker
    wrapper.init_worker(**kwargs)
  File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 449, in init_worker
    self.worker = worker_class(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/worker/cpu_worker.py", line 169, in __init__
    self.model_runner: CPUModelRunner = ModelRunnerClass(
                                        ^^^^^^^^^^^^^^^^^
  File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/worker/cpu_model_runner.py", line 421, in __init__
    self.attn_backend = get_attn_backend(
                        ^^^^^^^^^^^^^^^^^
  File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/attention/selector.py", line 116, in get_attn_backend
    from vllm.attention.backends.xformers import (  # noqa: F401
  File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/attention/backends/xformers.py", line 6, in <module>
    from xformers import ops as xops
ModuleNotFoundError: No module named 'xformers'