vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.29k stars 4.75k forks source link

[Bug]: Pixtral-12B not supported on CPU #8693

Closed joelimgu closed 1 month ago

joelimgu commented 2 months ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... WARNING 09-21 15:29:13 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory') PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: 14.0.0-1ubuntu1.1 CMake version: version 3.30.3 Libc version: glibc-2.35 Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD Ryzen 7 5800X 8-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU max MHz: 4850.1948 CPU min MHz: 2200.0000 BogoMIPS: 7585.94 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap Virtualization: AMD-V L1d cache: 256 KiB (8 instances) L1i cache: 256 KiB (8 instances) L2 cache: 4 MiB (8 instances) L3 cache: 32 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.1.post2@9ba0817ff1eb514f51cc6de9cb8e16c98d6ee44f vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect ```

Model Input Dumps

No response

🐛 Describe the bug

I am running into an error when trying to run Pixtral12B on CPU. Here is the sample code I am using:

from vllm import LLM
from vllm.sampling_params import SamplingParams

model_name = "mistralai/Pixtral-12B-2409"

sampling_params = SamplingParams(max_tokens=8192)

llm = LLM(model=model_name, tokenizer_mode="mistral")

prompt = "Describe this image in one sentence."
image_url = "https://picsum.photos/id/237/200/300"

messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}]
    },
]

outputs = llm.chat(messages, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

And here is the output of the program:

python3 main.py
WARNING 09-21 15:31:18 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
INFO 09-21 15:31:19 config.py:1653] Downcasting torch.float32 to torch.float16.
WARNING 09-21 15:31:19 arg_utils.py:910] The model has a long context length (128000). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 09-21 15:31:19 llm_engine.py:223] Initializing an LLM engine (v0.6.1.post2) with config: model='mistralai/Pixtral-12B-2409', speculative_config=None, tokenizer='mistralai/Pixtral-12B-2409', skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=128000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=mistralai/Pixtral-12B-2409, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
Traceback (most recent call last):
  File "/home/joel/Documents/Code/Personal/pixtral/main.py", line 8, in <module>
    llm = LLM(model=model_name, tokenizer_mode="mistral")
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 178, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 550, in from_engine_args
    engine = cls(
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 317, in __init__
    self.model_executor = executor_class(
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 38, in _init_executor
    self.driver_worker = self._create_worker()
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 105, in _create_worker
    return create_worker(**self._get_create_worker_kwargs(
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 24, in create_worker
    wrapper.init_worker(**kwargs)
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 449, in init_worker
    self.worker = worker_class(*args, **kwargs)
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 99, in __init__
    self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 960, in __init__
    self.attn_backend = get_attn_backend(
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/attention/selector.py", line 108, in get_attn_backend
    backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
  File "/home/joel/Documents/Code/Personal/pixtral/.venv/lib/python3.10/site-packages/vllm/attention/selector.py", line 215, in which_attn_to_use
    if current_platform.get_device_capability()[0] < 8:
TypeError: 'NoneType' object is not subscriptable

Before submitting a new issue...

DarkLight1337 commented 2 months ago

I think you need to install the ROCm version of PyTorch, e.g..

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
joelimgu commented 2 months ago

Thanks for the answer. I've tried installing the ROCm version of PyTorch, but I got the same result. But I am trying to run it on CPU, I have an AMD GPU, but it doesn't support ROCm.

DarkLight1337 commented 2 months ago

I see, sorry I missed the part about wanting to support this on CPU. Let me update the title to reflect this.

DarkLight1337 commented 2 months ago

Have you followed the installation instructions for CPU shown here?

DarkLight1337 commented 2 months ago

@youkaichao how can I tell from the collect_env output whether vLLM was compiled for CPU or GPU?

joelimgu commented 2 months ago

Yes, I've tried following the CPU tutorial. Same problem if I run it natively. I have a memory leak if I run it on docker. It takes 100G of ram, but there is already a discussion on that (https://github.com/vllm-project/vllm/discussions/309).

DarkLight1337 commented 2 months ago

I think #8534 should fix the particular error you're running into by considering the case of device_capability=None, however I am not sure whether the model can run. Can you try installing vLLM from source using the latest main branch?

youkaichao commented 2 months ago

this section shows vllm build flags:

vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled