vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.79k stars 4.5k forks source link

[Bug]: vLLM v0.6.0 (CPU) server failed to start on setting VLLM_CPU_OMP_THREADS_BIND #8330

Closed jerin-scalers-ai closed 1 month ago

jerin-scalers-ai commented 1 month ago

Your current environment

vLLM version: v0.6.0 (CPU) CPU: AMD EPYC 9654

πŸ› Describe the bug

vLLM v0.6.0 (CPU) server failed to start on setting VLLM_CPU_OMP_THREADS_BIND as shown below:

docker run --name vllm -p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=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 -e VLLM_CPU_KVCACHE_SPACE=40 -e VLLM_CPU_OMP_THREADS_BIND=0-29 -v /mnt/models:/root/.cache/huggingface:rw 121701826775.dkr.ecr.us-east-1.amazonaws.com/cpu/vllm:v0.6.0 --model=microsoft/Phi-3.5-mini-instruct --dtype=bfloat16 --max-model-len=2048

vLLM Error Log:

INFO 09-10 08:18:53 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
INFO 09-10 08:18:55 api_server.py:495] vLLM API server version 0.6.0
INFO 09-10 08:18:55 api_server.py:496] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='microsoft/Phi-3.5-mini-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 09-10 08:18:55 api_server.py:162] Multiprocessing frontend to use ipc:///tmp/46e65db1-2925-4568-b0b7-bd670b65e0f6 for RPC Path.
INFO 09-10 08:18:55 api_server.py:178] Started engine process with PID 75
INFO 09-10 08:18:57 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
WARNING 09-10 08:18:58 config.py:370] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
INFO 09-10 08:18:58 llm_engine.py:232] Initializing an LLM engine (v0.6.0) with config: model='microsoft/Phi-3.5-mini-instruct', speculative_config=None, tokenizer='microsoft/Phi-3.5-mini-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=microsoft/Phi-3.5-mini-instruct, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
WARNING 09-10 08:18:59 cpu_executor.py:324] CUDA graph is not supported on CPU, fallback to the eager mode.
(VllmWorkerProcess pid=141) INFO 09-10 08:18:59 selector.py:183] Cannot use _Backend.FLASH_ATTN backend on CPU.
(VllmWorkerProcess pid=141) INFO 09-10 08:18:59 selector.py:128] Using Torch SDPA backend.
(VllmWorkerProcess pid=141) INFO 09-10 08:19:00 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
get_mempolicy: Operation not permitted
(VllmWorkerProcess pid=141) ERROR 09-10 08:19:00 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: numa_migrate_pages failed. errno: 1, Traceback (most recent call last):
(VllmWorkerProcess pid=141) ERROR 09-10 08:19:00 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=141) ERROR 09-10 08:19:00 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=141) ERROR 09-10 08:19:00 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 210, in init_device
(VllmWorkerProcess pid=141) ERROR 09-10 08:19:00 multiproc_worker_utils.py:226]     torch.ops._C_utils.init_cpu_threads_env(self.local_omp_cpuid)
(VllmWorkerProcess pid=141) ERROR 09-10 08:19:00 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1061, in __call__
(VllmWorkerProcess pid=141) ERROR 09-10 08:19:00 multiproc_worker_utils.py:226]     return self_._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=141) ERROR 09-10 08:19:00 multiproc_worker_utils.py:226] RuntimeError: numa_migrate_pages failed. errno: 1
(VllmWorkerProcess pid=141) ERROR 09-10 08:19:00 multiproc_worker_utils.py:226] 
INFO 09-10 08:19:00 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 324, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 116, in _init_executor
    self._run_workers("init_device")
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 184, in _run_workers
    driver_worker_output = self.driver_method_invoker(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 367, in _async_driver_method_invoker
    return driver.execute_method(method, *args, **kwargs).get()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 58, in get
    raise self.result.exception
RuntimeError: numa_migrate_pages failed. errno: 1
ERROR 09-10 08:19:05 api_server.py:188] RPCServer process died before responding to readiness probe

Before submitting a new issue...

bigPYJ1151 commented 1 month ago

hi @jerin-scalers-ai , please setting --privileged=true when starting the container.

jerin-scalers-ai commented 1 month ago

Thanks. It worked by setting --privileged=true. I haven't faced this issue on a slightly older version of vllm (0.5.3.post1). The VLLM_CPU_OMP_THREADS_BIND was working out of box without --privileged=true. It will be great if you can include this instruction in the documentation.