vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.85k stars 4.69k forks source link

[Bug]: Gemma2 model not working with vLLM 0.6.0 CPU backend #8660

Open jerin-scalers-ai opened 2 months ago

jerin-scalers-ai commented 2 months ago

Your current environment

🐛 Describe the bug

vLLM v0.6.0 (cpu) is throwing below error on loading Gemma2 model.

Run vLLM:

docker run -p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=<token>  -e VLLM_CPU_KVCACHE_SPACE=40 -v /mnt/models:/root/.cache/huggingface:rw cpu/vllm:v0.6.0 --model=google/gemma-2-2b --dtype=float32 --max-model-len=2048
INFO 09-20 08:05:38 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
INFO 09-20 08:05:40 api_server.py:495] vLLM API server version 0.6.0
INFO 09-20 08:05:40 api_server.py:496] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='google/gemma-2-2b', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='float32', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 09-20 08:05:40 config.py:1625] For Gemma 2, we downcast float32 to bfloat16 instead of float16 by default. Please specify `dtype` if you want to use float16.
INFO 09-20 08:05:40 config.py:1653] Downcasting torch.float32 to torch.bfloat16.
WARNING 09-20 08:05:40 utils.py:723] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 09-20 08:05:40 api_server.py:162] Multiprocessing frontend to use ipc:///tmp/e9fcdd7b-d17a-49c2-854d-f41c9b30b9ba for RPC Path.
INFO 09-20 08:05:40 api_server.py:178] Started engine process with PID 76
INFO 09-20 08:05:42 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
WARNING 09-20 08:05:44 utils.py:723] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
WARNING 09-20 08:05:44 config.py:370] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
INFO 09-20 08:05:44 llm_engine.py:232] Initializing an LLM engine (v0.6.0) with config: model='google/gemma-2-2b', speculative_config=None, tokenizer='google/gemma-2-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float32, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=google/gemma-2-2b, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
WARNING 09-20 08:05:46 cpu_executor.py:324] CUDA graph is not supported on CPU, fallback to the eager mode.
(VllmWorkerProcess pid=143) INFO 09-20 08:05:46 selector.py:183] Cannot use _Backend.FLASH_ATTN backend on CPU.
(VllmWorkerProcess pid=143) INFO 09-20 08:05:46 selector.py:128] Using Torch SDPA backend.
(VllmWorkerProcess pid=143) INFO 09-20 08:05:47 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=143) INFO 09-20 08:05:48 selector.py:183] Cannot use _Backend.FLASH_ATTN backend on CPU.
(VllmWorkerProcess pid=143) INFO 09-20 08:05:48 selector.py:128] Using Torch SDPA backend.
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Torch SPDA does not support logits soft cap., Traceback (most recent call last):
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 217, in load_model
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     self.model_runner.load_model()
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 125, in load_model
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     self.model = get_model(model_config=self.model_config,
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 357, in load_model
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     model = _initialize_model(model_config, self.load_config,
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 171, in _initialize_model
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     return build_model(
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 156, in build_model
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     return model_class(config=hf_config,
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/gemma2.py", line 329, in __init__
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     self.model = Gemma2Model(config, cache_config, quant_config)
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/gemma2.py", line 255, in __init__
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     self.layers = nn.ModuleList([
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/gemma2.py", line 256, in <listcomp>
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     Gemma2DecoderLayer(layer_idx, config, cache_config, quant_config)
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/gemma2.py", line 181, in __init__
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     self.self_attn = Gemma2Attention(
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/gemma2.py", line 147, in __init__
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     self.attn = Attention(self.num_heads,
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 84, in __init__
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/torch_sdpa.py", line 123, in __init__
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226]     raise ValueError("Torch SPDA does not support logits soft cap.")
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226] ValueError: Torch SPDA does not support logits soft cap.
(VllmWorkerProcess pid=143) ERROR 09-20 08:05:48 multiproc_worker_utils.py:226] 
INFO 09-20 08:05:48 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 324, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 117, in _init_executor
    self._run_workers("load_model")
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 184, in _run_workers
    driver_worker_output = self.driver_method_invoker(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 367, in _async_driver_method_invoker
    return driver.execute_method(method, *args, **kwargs).get()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 58, in get
    raise self.result.exception
ValueError: Torch SPDA does not support logits soft cap.
ERROR 09-20 08:05:50 api_server.py:188] RPCServer process died before responding to readiness probe

Before submitting a new issue...

youkaichao commented 2 months ago

please use gpu, cpu is not supported yet

hahmad2008 commented 2 months ago

@youkaichao does gemma2 working with 8k context length?

Avinash-Raj commented 1 month ago

@youkaichao is this model specific or embeddings won't work with cpu?

https://github.com/vllm-project/vllm/issues/9379

@jerin-scalers-ai can you able to load mistral embedding (intfloat/e5-mistral-7b-instruct) model on cpu?