[Bug]: vLLM CPU mode broken Unable to get JIT kernel for brgemm

samos123 commented 7 hours ago

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.5.1+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 Clang version: Could not collect CMake version: version 3.31.0 Libc version: glibc-2.35 Python version: 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.8.0-1018-gcp-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: AuthenticAMD Model name: AMD EPYC 7B12 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 0 BogoMIPS: 4499.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save umip rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 128 KiB (4 instances) L1i cache: 128 KiB (4 instances) L2 cache: 2 MiB (4 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] intel_extension_for_pytorch==2.5.0 [pip3] numpy==1.26.4 [pip3] pyzmq==26.2.0 [pip3] torch==2.5.1+cpu [pip3] torchvision==0.20.1+cpu [pip3] transformers==4.46.3 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.4.post2.dev22+g47826cac vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_LOGGING_LEVEL=DEBUG LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/cv2/../../lib64: ```

Model Input Dumps

No response

🐛 Describe the bug

Seems oneDNN is missing in latest Dockerfile or at least some part of it seems missing.

This used to work in v0.6.3 fine though. I was suspecting this PR since it touched how oneDNN is included: https://github.com/vllm-project/vllm/pull/9344

Steps to reproduce:

Git clone vllm repo latest main branch or v0.6.4.post1
Build the cpu docker image: docker build -t test-cpu -f Dockerfile.cpu .

Run the openai server:

docker run -d --name vllm -p 8000:8000 -e VLLM_LOGGING_LEVEL=DEBUG -e ONEDNN_VERBOSE=all -e VLLM_WORKER_MULTIPROC_METHOD=spawn test-cpu --model facebook/opt-125m --disable-frontend-multiprocessing

Wait for server to be ready and then send a simple prompt:

curl -v --fail-with-body --show-error http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'

Full logs:

INFO 11-20 06:32:30 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 11-20 06:32:31 api_server.py:592] vLLM API server version 0.6.4.post2.dev22+g47826cac
INFO 11-20 06:32:31 api_server.py:593] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=True, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='facebook/opt-125m', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 11-20 06:32:31 __init__.py:31] No plugins found.
WARNING 11-20 06:32:37 _logger.py:72] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-20 06:32:37 _logger.py:72] Async output processing is only supported for CUDA, TPU, XPU and HPU.Disabling it for other platforms.
WARNING 11-20 06:32:37 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 11-20 06:32:37 _logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 11-20 06:32:37 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post2.dev22+g47826cac) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None, pooler_config=None)
INFO 11-20 06:32:42 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available.
(VllmWorkerProcess pid=29) INFO 11-20 06:32:43 __init__.py:31] No plugins found.
(VllmWorkerProcess pid=29) INFO 11-20 06:32:43 selector.py:221] Cannot use _Backend.FLASH_ATTN backend on CPU.
(VllmWorkerProcess pid=29) INFO 11-20 06:32:43 selector.py:156] Using Torch SDPA backend.
(VllmWorkerProcess pid=29) INFO 11-20 06:32:43 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=29) DEBUG 11-20 06:32:43 parallel_state.py:983] world_size=1 rank=0 local_rank=-1 distributed_init_method=tcp://127.0.0.1:35451 backend=gloo
(VllmWorkerProcess pid=29) DEBUG 11-20 06:32:43 decorators.py:84] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.opt.OPTModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(VllmWorkerProcess pid=29) INFO 11-20 06:32:43 weight_utils.py:243] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(VllmWorkerProcess pid=29) /usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/weight_utils.py:425: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
(VllmWorkerProcess pid=29)   state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.62it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.61it/s]
(VllmWorkerProcess pid=29)
INFO 11-20 06:32:45 cpu_executor.py:195] # CPU blocks: 7281
DEBUG 11-20 06:32:45 decorators.py:84] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.opt.OPTModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
INFO 11-20 06:32:45 api_server.py:534] Using supplied chat template:
INFO 11-20 06:32:45 api_server.py:534] None
INFO 11-20 06:32:45 launcher.py:19] Available routes are:
INFO 11-20 06:32:45 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 11-20 06:32:45 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 11-20 06:32:45 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 11-20 06:32:45 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 11-20 06:32:45 launcher.py:27] Route: /health, Methods: GET
INFO 11-20 06:32:45 launcher.py:27] Route: /tokenize, Methods: POST
INFO 11-20 06:32:45 launcher.py:27] Route: /detokenize, Methods: POST
INFO 11-20 06:32:45 launcher.py:27] Route: /v1/models, Methods: GET
INFO 11-20 06:32:45 launcher.py:27] Route: /version, Methods: GET
INFO 11-20 06:32:45 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 11-20 06:32:45 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 11-20 06:32:45 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 11-20 06:32:55 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 11-20 06:33:05 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 11-20 06:33:15 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 11-20 06:33:18 logger.py:37] Received request cmpl-3706abf5f25b4d21a79004dd76c22c0f-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [2, 16033, 2659, 16, 10], lora_request: None, prompt_adapter_request: None.
INFO 11-20 06:33:18 async_llm_engine.py:208] Added request cmpl-3706abf5f25b4d21a79004dd76c22c0f-0.
DEBUG 11-20 06:33:18 async_llm_engine.py:836] Waiting for new requests...
DEBUG 11-20 06:33:18 async_llm_engine.py:855] Got new requests!
ERROR 11-20 06:33:19 multiproc_worker_utils.py:116] Worker VllmWorkerProcess pid 29 died, exit code: 1
INFO 11-20 06:33:19 multiproc_worker_utils.py:120] Killing local vLLM worker processes
ERROR 11-20 06:33:19 async_llm_engine.py:65] Engine background task failed
ERROR 11-20 06:33:19 async_llm_engine.py:65] Traceback (most recent call last):
ERROR 11-20 06:33:19 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
ERROR 11-20 06:33:19 async_llm_engine.py:65]     return_value = task.result()
ERROR 11-20 06:33:19 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 872, in run_engine_loop
ERROR 11-20 06:33:19 async_llm_engine.py:65]     result = task.result()
ERROR 11-20 06:33:19 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 795, in engine_step
ERROR 11-20 06:33:19 async_llm_engine.py:65]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 11-20 06:33:19 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 347, in step_async
ERROR 11-20 06:33:19 async_llm_engine.py:65]     outputs = await self.model_executor.execute_model_async(
ERROR 11-20 06:33:19 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 298, in execute_model_async
ERROR 11-20 06:33:19 async_llm_engine.py:65]     output = await make_async(self.execute_model
ERROR 11-20 06:33:19 async_llm_engine.py:65]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 11-20 06:33:19 async_llm_engine.py:65]     result = self.fn(*self.args, **self.kwargs)
ERROR 11-20 06:33:19 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 210, in execute_model
ERROR 11-20 06:33:19 async_llm_engine.py:65]     output = self.driver_method_invoker(self.driver_worker,
ERROR 11-20 06:33:19 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 311, in _async_driver_method_invoker
ERROR 11-20 06:33:19 async_llm_engine.py:65]     return driver.execute_method(method, *args, **kwargs).get()
ERROR 11-20 06:33:19 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 54, in get
ERROR 11-20 06:33:19 async_llm_engine.py:65]     raise self.result.exception
ERROR 11-20 06:33:19 async_llm_engine.py:65] ChildProcessError: worker died
Unable to get JIT kernel for brgemm. Params: M=5, N=5, K=64, str_a=1, str_b=1, brgemm_type=1, beta=0, a_trans=0, unroll_hint=1, lda=2304, ldb=5, ldc=5, config=0, b_vnni=02024-11-20 06:33:19,225 - __init__.py - asyncio - ERROR - Exception in callback functools.partial(<function _log_task_completion at 0x7646b03997e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7646ae8d3310>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7646b03997e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7646ae8d3310>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 872, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 795, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 347, in step_async
    outputs = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 298, in execute_model_async
    output = await make_async(self.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 210, in execute_model
    output = self.driver_method_invoker(self.driver_worker,
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 311, in _async_driver_method_invoker
    return driver.execute_method(method, *args, **kwargs).get()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 54, in get
    raise self.result.exception
ChildProcessError: worker died

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 67, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO:     172.17.0.1:37946 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 76, in collapse_excgroups
  |     yield
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 186, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 763, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    |     result = await app(  # type: ignore[func-returns-value]
    |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    |     return await self.app(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    |     await super().__call__(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 113, in __call__
    |     await self.middleware_stack(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
    |     raise exc
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
    |     await self.app(scope, receive, _send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 185, in __call__
    |     with collapse_excgroups():
    |   File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    |     self.gen.throw(typ, value, traceback)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 82, in collapse_excgroups
    |     raise exc
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 187, in __call__
    |     response = await self.dispatch_func(request, call_next)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 491, in add_request_id
    |     response = await call_next(request)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 163, in call_next
    |     raise app_exc
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 149, in coro
    |     await self.app(scope, receive_or_disconnect, send_no_error)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    |     await self.app(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    |     raise exc
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    |     await app(scope, receive, sender)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
    |     await self.middleware_stack(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
    |     await route.handle(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
    |     await self.app(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
    |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    |     raise exc
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    |     await app(scope, receive, sender)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 73, in app
    |     response = await f(request)
    |   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 301, in app
    |     raw_response = await run_endpoint_function(
    |   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 212, in run_endpoint_function
    |     return await dependant.call(**values)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 367, in create_completion
    |     generator = await handler.create_completion(request, raw_request)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 189, in create_completion
    |     async for i, res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 444, in merge_async_iterators
    |     item = await d
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 1051, in generate
    |     async for output in await self.add_request(
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 113, in generator
    |     raise result
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
    |     return_value = task.result()
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 872, in run_engine_loop
    |     result = task.result()
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 795, in engine_step
    |     request_outputs = await self.engine.step_async(virtual_engine)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 347, in step_async
    |     outputs = await self.model_executor.execute_model_async(
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 298, in execute_model_async
    |     output = await make_async(self.execute_model
    |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    |     result = self.fn(*self.args, **self.kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 210, in execute_model
    |     output = self.driver_method_invoker(self.driver_worker,
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 311, in _async_driver_method_invoker
    |     return driver.execute_method(method, *args, **kwargs).get()
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 54, in get
    |     raise self.result.exception
    | ChildProcessError: worker died
    +------------------------------------

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 185, in __call__
    with collapse_excgroups():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 82, in collapse_excgroups
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 187, in __call__
    response = await self.dispatch_func(request, call_next)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 491, in add_request_id
    response = await call_next(request)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 163, in call_next
    raise app_exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 149, in coro
    await self.app(scope, receive_or_disconnect, send_no_error)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 367, in create_completion
    generator = await handler.create_completion(request, raw_request)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 189, in create_completion
    async for i, res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 444, in merge_async_iterators
    item = await d
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 1051, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 113, in generator
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 872, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 795, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 347, in step_async
    outputs = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 298, in execute_model_async
    output = await make_async(self.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 210, in execute_model
    output = self.driver_method_invoker(self.driver_worker,
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 311, in _async_driver_method_invoker
    return driver.execute_method(method, *args, **kwargs).get()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 54, in get
    raise self.result.exception
ChildProcessError: worker died
INFO 11-20 06:33:25 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 11-20 06:33:35 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 6 hours ago

cc @bigPYJ1151

bigPYJ1151 commented 6 hours ago

Hmm, please run lscpu and let me check instruction sets on your platform. I exactly followed the steps and try to reproduce the bug on my platform, but both AVX512 and AVX2 versions worked well.

samos123 commented 6 hours ago

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                AuthenticAMD
  Model name:             AMD EPYC 7B12
    CPU family:           23
    Model:                49
    Thread(s) per core:   2
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             0
    BogoMIPS:             4499.99
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fx
                          sr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl
                           nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4
                          _2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_leg
                          acy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgs
                          base tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xs
                          avec xgetbv1 clzero xsaveerptr arat npt nrip_save umip rdpid
Virtualization features:
  Hypervisor vendor:      KVM
  Virtualization type:    full
Caches (sum of all):
  L1d:                    128 KiB (4 instances)
  L1i:                    128 KiB (4 instances)
  L2:                     2 MiB (4 instances)
  L3:                     16 MiB (1 instance)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-7
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
  Spec rstack overflow:   Mitigation; Safe RET
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS No
                          t affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

vllm-project / vllm