vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.85k stars 4.51k forks source link

[Bug]: DeepSeek-Coder-V2-Lite-Instruct with CPU : Torch not compiled with CUDA enabled #6655

Closed papipsycho closed 3 months ago

papipsycho commented 3 months ago

Your current environment

python collect_env.py
Collecting environment information...
WARNING 07-22 17:54:45 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
PyTorch version: 2.3.1+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-1022-aws-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             4
On-line CPU(s) list:                0-3
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
CPU family:                         6
Model:                              85
Thread(s) per core:                 2
Core(s) per socket:                 2
Socket(s):                          1
Stepping:                           7
BogoMIPS:                           5000.01
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          64 KiB (2 instances)
L1i cache:                          64 KiB (2 instances)
L2 cache:                           2 MiB (2 instances)
L3 cache:                           35.8 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-3
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Vulnerable
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.1+cpu
[pip3] torchvision==0.18.1+cpu
[pip3] transformers==4.42.4
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

πŸ› Describe the bug

Hello,

i'm testing vllm, and i was testing DeepSeek-Coder-V2-Lite-Instruct from the CPU, but it try to use CUDA

vllm serve deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --dtype auto --api-key token-abc123 --trust-remote-code
INFO 07-22 17:52:26 api_server.py:219] vLLM API server version 0.5.2
INFO 07-22 17:52:26 api_server.py:220] args: Namespace(model_tag='deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='token-abc123', lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7ca18110f1c0>)
configuration_deepseek.py: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10.3k/10.3k [00:00<00:00, 38.5MB/s]
INFO 07-22 17:52:26 llm_engine.py:176] Initializing an LLM engine (v0.5.2) with config: model='deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=163840, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
tokenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.28k/1.28k [00:00<00:00, 8.27MB/s]
tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4.61M/4.61M [00:00<00:00, 97.5MB/s]
generation_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 181/181 [00:00<00:00, 1.13MB/s]
WARNING 07-22 17:52:27 cpu_executor.py:136] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 07-22 17:52:27 cpu_executor.py:163] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 07-22 17:52:27 selector.py:117] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 07-22 17:52:27 selector.py:66] Using Torch SDPA backend.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/vllm-1/venv/bin/vllm", line 33, in <module>
[rank0]:     sys.exit(load_entry_point('vllm==0.5.2+cpu', 'console_scripts', 'vllm')())
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/scripts.py", line 148, in main
[rank0]:     args.dispatch_function(args)
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/scripts.py", line 28, in serve
[rank0]:     run_server(args)
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 31, in _init_executor
[rank0]:     self._init_worker()
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 58, in _init_worker
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 185, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 125, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/model_loader/loader.py", line 275, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/model_loader/loader.py", line 111, in _initialize_model
[rank0]:     return model_class(config=model_config.hf_config,
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/deepseek_v2.py", line 439, in __init__
[rank0]:     self.model = DeepseekV2Model(config, cache_config, quant_config)
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/deepseek_v2.py", line 401, in __init__
[rank0]:     self.layers = nn.ModuleList([
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/deepseek_v2.py", line 402, in <listcomp>
[rank0]:     DeepseekV2DecoderLayer(config,
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/deepseek_v2.py", line 321, in __init__
[rank0]:     self.self_attn = DeepseekV2Attention(
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/deepseek_v2.py", line 231, in __init__
[rank0]:     self.rotary_emb = get_rope(qk_rope_head_dim,
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/layers/rotary_embedding.py", line 839, in get_rope
[rank0]:     rotary_emb = DeepseekScalingRotaryEmbedding(
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/layers/rotary_embedding.py", line 652, in __init__
[rank0]:     super().__init__(head_size, rotary_dim, max_position_embeddings, base,
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/layers/rotary_embedding.py", line 80, in __init__
[rank0]:     cache = self._compute_cos_sin_cache()
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/layers/rotary_embedding.py", line 674, in _compute_cos_sin_cache
[rank0]:     inv_freq = self._compute_inv_freq(self.scaling_factor)
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/vllm-0.5.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/layers/rotary_embedding.py", line 656, in _compute_inv_freq
[rank0]:     pos_freqs = self.base**(torch.arange(
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/torch/utils/_device.py", line 78, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm-1/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 284, in _lazy_init
[rank0]:     raise AssertionError("Torch not compiled with CUDA enabled")
[rank0]: AssertionError: Torch not compiled with CUDA enabled
papipsycho commented 3 months ago

So i made some research, and the issue seems to came from :

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding.py#L657

and

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding.py#L676

bigPYJ1151 commented 3 months ago

Hi @papipsycho, unfortunately the MOE related models are not supported in the CPU backend for now.

papipsycho commented 3 months ago

Yes i just realize that,finally able to launch it and i got this message : The CPU backend currently does not support MoE.

@bigPYJ1151