[Usage]: Running Phi3.5 on Intel x86 MacBook Pro?

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Apache License 2.0

30.66k stars 4.65k forks source link

Your current environment

Collecting environment information...
WARNING 10-29 12:20:54 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 10-29 12:20:54 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
PyTorch version: 2.2.2
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 12.7.6 (x86_64)
GCC version: Could not collect
Clang version: 12.0.0 (clang-1200.0.26.2)
CMake version: Could not collect
Libc version: N/A

Python version: 3.12.4 (main, Oct 26 2024, 19:58:57) [Clang 12.0.0 (clang-1200.0.26.2)] (64-bit runtime)
Python platform: macOS-12.7.6-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz

Versions of relevant libraries:
[pdm] numpy==1.26.4
[pdm] nvidia-cublas-cu12==12.1.3.1; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cuda-cupti-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cuda-runtime-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cudnn-cu12==8.9.2.26; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cufft-cu12==11.0.2.54; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-curand-cu12==10.3.2.106; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cusolver-cu12==11.4.5.107; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-cusparse-cu12==12.1.0.106; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-ml-py==12.560.30
[pdm] nvidia-nccl-cu12==2.19.3; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-nvjitlink-cu12==12.6.77; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] nvidia-nvtx-cu12==12.1.105; platform_system == "Linux" and platform_machine == "x86_64"
[pdm] pyzmq==26.2.0
[pdm] torch==2.2.2
[pdm] torchvision==0.17.2
[pdm] transformers==4.46.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

How would you like to use vllm

I want to run inference of a [Phi-3.5 GGUF](https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF. I don't know how to integrate it with vllm.

Here is my attempt:

from vllm import LLM
import os

model = f"{os.path.join(os.getcwd(), ".model", "phi-3.5-gguf", "Phi-3.5-mini-instruct-Q4_K_M.gguf")}"

llm = LLM(model=model, device="cpu", tokenizer="microsoft/Phi-3.5-mini-instruct", max_model_len=8192)

outputs = llm.generate("""<|system|>
You are AI system that able to extract country from address and understand the boundaries of Country.<|end|>
<|user|>
Extract country from given address and report do the country extracted within United Kingdom? Generated response in JSON Object format with 2 key, 'extractedCountry' (string) and 'withInUK' (boolean) Given Address: Heineken UK Limited,3-4 Broadway Park,Edinburgh,EH12 9JZ.
<|end|>
<|assistant|>
{"extractedCountry":"Scotland","withInUK":true}<|end|>
<|user|>
Given Address: Brewed at:Sharp's Brewery Ltd.,Rock,Cornwall,PL27 6NU,UK.MCBC (Ireland) DAC,Block J1 Unit Centre,Maynooth Business Campus,Straffan Road,Republic of Ireland.<|end|>
<|assistant|>
{"extractedCountry":"Republic of Ireland","withInUK":false}<|end|>
<|user|>
Given Address: Brewed and Canned by:Birra Peroni S.r.l.,Via Birolli,8 - Roma,Italy.For:Asahi UK Ltd,Asahi House,88-100 Chertsey Road,Woking,GU21 5BJ,UK.<|end|>
<|assistant|>
{"extractedCountry":"Italy","withInUK":false}<|end|>
<|user|>
Given Address: Brewed & canned by:Camden Town Brewery,55-59 Wilkin Street,Mews,NW5 3NN,London,UK.<|end|>
<|assistant|>
{"extractedCountry":"England","withInUK":true}<|end|>
<|user|>
Given Address: Jubel Ltd,170 Kennington Lane,London,SE11 5DP.<|end|>
<|assistant|>
{"extractedCountry":"England","withInUK":true}<|end|>
<|user|>
Given Address: Brewed by:Heineken UK Limited,3-4 Broadway Park,Edinburgh,EH12 9JZ.HBBV.,Tweede Weteringplantsoen 21,1017 ZD Amsterdam,NL.<|end|>
<|assistant|>
{"extractedCountry":"Netherlands","withInUK":false}<|end|>
<|user|>
Given Address: Brewed & canned by:Camden Town Brewery,55-59 Wilkin Street,Mews,NW5 3NN,London,UK.<|end|>
<|assistant|>
{"extractedCountry":"England","withInUK":true}<|end|>
<|user|>
Given Address: Brewed and bottled by: Birra Peroni S.r.l., Via Birolli, 8, Roma. Asahi UK Ltd, Asahi House, 88-100 Chertsey Road, Woking, GU21 5BJ, UK.<|end|>
<|assistant|>
{"extractedCountry":"Italy","withInUK":false}<|end|>
<|user|>
Given Address: Specially manufactured for:Empire Bespoke Foods Ltd.,UK: Middlesex,UB5 6AG.ROI: Cork,T12 H1XY.<|end|>
<|assistant|>
""")

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

Source here

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

(experimental-vllm-3.12) ➜ experimental-vllm git:(main) pdm run ./src/experimental_vllm/main.py WARNING 10-29 12:25:22 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") INFO 10-29 12:25:22 importing.py:10] Triton not installed; certain GPU-related functions will not be available. INFO 10-29 12:25:34 config.py:1664] Downcasting torch.float32 to torch.float16. WARNING 10-29 12:25:43 config.py:321] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models. WARNING 10-29 12:25:43 config.py:380] Async output processing is only supported for CUDA, TPU, XPU. Disabling it for other platforms. INFO 10-29 12:25:43 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/Users/davidng/WebstormProjects/experimental-vllm/.model/phi-3.5-gguf/Phi-3.5-mini-instruct-Q4_K_M.gguf', speculative_config=None, tokenizer='microsoft/Phi-3.5-mini-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/Users/davidng/WebstormProjects/experimental-vllm/.model/phi-3.5-gguf/Phi-3.5-mini-instruct-Q4_K_M.gguf, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None) WARNING 10-29 12:25:50 cpu_executor.py:327] float16 is not supported on CPU, casting to bfloat16. WARNING 10-29 12:25:50 cpu_executor.py:332] CUDA graph is not supported on CPU, fallback to the eager mode. WARNING 10-29 12:25:50 cpu_executor.py:362] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default. INFO 10-29 12:25:50 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 10-29 12:25:50 selector.py:115] Using XFormers backend. Traceback (most recent call last): File "/Users/davidng/WebstormProjects/experimental-vllm/./src/experimental_vllm/main.py", line 6, in <module> llm = LLM(model=model, device="cpu", tokenizer="microsoft/Phi-3.5-mini-instruct", max_model_len=8192) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 177, in __init__ self.llm_engine = LLMEngine.from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 573, in from_engine_args engine = cls( ^^^^ File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 334, in __init__ self.model_executor = executor_class( ^^^^^^^^^^^^^^^ File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 47, in __init__ self._init_executor() File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/executor/cpu_executor.py", line 97, in _init_executor self.driver_worker = self._create_worker() ^^^^^^^^^^^^^^^^^^^^^ File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/executor/cpu_executor.py", line 155, in _create_worker wrapper.init_worker(**kwargs) File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 449, in init_worker self.worker = worker_class(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/worker/cpu_worker.py", line 169, in __init__ self.model_runner: CPUModelRunner = ModelRunnerClass( ^^^^^^^^^^^^^^^^^ File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/worker/cpu_model_runner.py", line 421, in __init__ self.attn_backend = get_attn_backend( ^^^^^^^^^^^^^^^^^ File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/attention/selector.py", line 116, in get_attn_backend from vllm.attention.backends.xformers import ( # noqa: F401 File "/Users/davidng/WebstormProjects/experimental-vllm/.venv/lib/python3.12/site-packages/vllm/attention/backends/xformers.py", line 6, in <module> from xformers import ops as xops ModuleNotFoundError: No module named 'xformers'

vllm-project / vllm