vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.71k stars 4.66k forks source link

[Bug]: Open AI server error with CPU only engine #4274

Open maktukmak opened 7 months ago

maktukmak commented 7 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.2.1+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 12 (bookworm) (x86_64)
GCC version: (Debian 12.2.0-14) 12.2.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.36

Python version: 3.10.13 (main, Dec 19 2023, 20:49:50) [GCC 12.2.0] (64-bit runtime)
Python platform: Linux-5.16.0-rc8-intel-next-01534-g53cb5f883cf7-x86_64-with-glibc2.36
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          224
On-line CPU(s) list:             0-223
Vendor ID:                       GenuineIntel
Model name:                      Genuine Intel(R) CPU 0000%@
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              56
Socket(s):                       2
Stepping:                        3
CPU(s) scaling MHz:              100%
CPU max MHz:                     1900.0000
CPU min MHz:                     800.0000
BogoMIPS:                        3800.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm uintr avx512_vp2intersect md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       5.3 MiB (112 instances)
L1i cache:                       3.5 MiB (112 instances)
L2 cache:                        224 MiB (112 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-55,112-167
NUMA node1 CPU(s):               56-111,168-223
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.2
[pip3] torch==2.2.1+cpu
[pip3] triton==2.3.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

🐛 Describe the bug

I am testing CPU only engine by following the OpenAI Compatible Server example in docs. I deploy the model using:

python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --dtype auto --api-key token-abc123 > out.txt

After I start a request from the client, the server gives the following error:

INFO 04-22 20:21:26 api_server.py:149] vLLM API server version 0.4.1
INFO 04-22 20:21:26 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='token-abc123', served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 04-22 20:21:27 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 04-22 20:21:27 cpu_executor.py:128] float16 is not supported on CPU, casting to bfloat16.
WARNING 04-22 20:21:27 cpu_executor.py:131] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 04-22 20:21:27 cpu_executor.py:159] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 04-22 20:21:27 selector.py:43] Using Torch SDPA backend.
INFO 04-22 20:21:27 weight_utils.py:193] Using model weights format ['*.bin']
INFO 04-22 20:21:28 cpu_executor.py:72] # CPU blocks: 7281
WARNING 04-22 20:21:28 serving_chat.py:340] No chat template provided. Chat API will not work.
INFO 04-22 20:21:36 async_llm_engine.py:524] Received request cmpl-54b21b26d839478ea6aa9bed5204e344: prompt: 'Hello!</s>', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2044, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [2, 31414, 328, 2], lora_request: None.
INFO 04-22 20:21:36 pynccl_utils.py:17] Failed to import NCCL library: NCCL only supports CUDA and ROCm backends.
INFO 04-22 20:21:36 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
ERROR 04-22 20:21:36 async_llm_engine.py:43] Engine background task failed
ERROR 04-22 20:21:36 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 04-22 20:21:36 async_llm_engine.py:43]     task.result()
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop
ERROR 04-22 20:21:36 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return fut.result()
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/engine/async_llm_engine.py", line 470, in engine_step
ERROR 04-22 20:21:36 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/engine/async_llm_engine.py", line 213, in step_async
ERROR 04-22 20:21:36 async_llm_engine.py:43]     output = await self.model_executor.execute_model_async(
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/executor/cpu_executor.py", line 113, in execute_model_async
ERROR 04-22 20:21:36 async_llm_engine.py:43]     output = await make_async(self.driver_worker.execute_model)(
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 04-22 20:21:36 async_llm_engine.py:43]     result = self.fn(*self.args, **self.kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/worker/cpu_worker.py", line 289, in execute_model
ERROR 04-22 20:21:36 async_llm_engine.py:43]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/worker/cpu_model_runner.py", line 418, in execute_model
ERROR 04-22 20:21:36 async_llm_engine.py:43]     hidden_states = model_executable(**execute_model_kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/model_executor/models/opt.py", line 299, in forward
ERROR 04-22 20:21:36 async_llm_engine.py:43]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/model_executor/models/opt.py", line 274, in forward
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return self.decoder(input_ids, positions, kv_caches, attn_metadata)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/model_executor/models/opt.py", line 248, in forward
ERROR 04-22 20:21:36 async_llm_engine.py:43]     hidden_states = layer(hidden_states, kv_caches[i], attn_metadata)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/model_executor/models/opt.py", line 156, in forward
ERROR 04-22 20:21:36 async_llm_engine.py:43]     hidden_states = self.self_attn(hidden_states=hidden_states,
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/model_executor/models/opt.py", line 100, in forward
ERROR 04-22 20:21:36 async_llm_engine.py:43]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/attention/layer.py", line 48, in forward
ERROR 04-22 20:21:36 async_llm_engine.py:43]     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/attention/backends/torch_sdpa.py", line 132, in forward
ERROR 04-22 20:21:36 async_llm_engine.py:43]     PagedAttention.write_to_paged_cache(key, value, key_cache,
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/attention/ops/paged_attn.py", line 72, in write_to_paged_cache
ERROR 04-22 20:21:36 async_llm_engine.py:43]     ops.reshape_and_cache(
ERROR 04-22 20:21:36 async_llm_engine.py:43]   File "/workspaces/vllm/vllm/_custom_ops.py", line 175, in reshape_and_cache
ERROR 04-22 20:21:36 async_llm_engine.py:43]     vllm_cache_ops.reshape_and_cache(key, value, key_cache, value_cache,
ERROR 04-22 20:21:36 async_llm_engine.py:43] TypeError: reshape_and_cache(): incompatible function arguments. The following argument types are supported:
ERROR 04-22 20:21:36 async_llm_engine.py:43]     1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor) -> None
ERROR 04-22 20:21:36 async_llm_engine.py:43] 
ERROR 04-22 20:21:36 async_llm_engine.py:43] Invoked with: tensor([[[ 1.2500e+00, -6.0938e-01,  1.6484e+00,  ...,  1.0781e+00,
ERROR 04-22 20:21:36 async_llm_engine.py:43]            1.6406e-01,  5.8203e-01],
ERROR 04-22 20:21:36 async_llm_engine.py:43]          [-1.1084e-01,  1.6562e+00,  9.9219e-01,  ...,  2.5781e-01,
ERROR 04-22 20:21:36 async_llm_engine.py:43]            4.7070e-01, -3.1055e-01],
ERROR 04-22 20:21:36 async_llm_engine.py:43]          [-4.6680e-01, -5.8203e-01,  9.1016e-01,  ..., -1.3867e-01,
ERROR 04-22 20:21:36 async_llm_engine.py:43]           -1.4375e+00, -2.1875e+00],
ERROR 04-22 20:21:36 async_llm_engine.py:43]          ...,
ERROR 04-22 20:21:36 async_llm_engine.py:43]          [ 1.1562e+00,  5.9766e-01, -2.0117e-01,  ...,  1.1250e+00,
ERROR 04-22 20:21:36 async_llm_engine.py:43]           -9.6484e-01, -3.9062e-01],
ERROR 04-22 20:21:36 async_llm_engine.py:43]          [-3.5352e-01, -9.3262e-02, -1.0469e+00,  ..., -4.0820e-01,
ERROR 04-22 20:21:36 async_llm_engine.py:43]            7.5391e-01,  2.8125e-01],
ERROR 04-22 20:21:36 async_llm_engine.py:43]          [ 1.1562e+00,  3.2227e-01, -6.4062e-01,  ..., -4.9023e-01,
ERROR 04-22 20:21:36 async_llm_engine.py:43]            7.8906e-01,  8.4766e-01]],
ERROR 04-22 20:21:36 async_llm_engine.py:43] 
ERROR 04-22 20:21:36 async_llm_engine.py:43]         [[ 2.7344e+00, -2.1406e+00,  1.9531e+00,  ...,  1.3359e+00,
ERROR 04-22 20:21:36 async_llm_engine.py:43]            5.8203e-01,  6.9141e-01],
ERROR 04-22 20:21:36 async_llm_engine.py:43]          [-2.1562e+00,  9.1406e-01,  1.9531e+00,  ...,  1.1016e+00,
ERROR 04-22 20:21:36 async_llm_engine.py:43]           -1.9824e-01,  5.1172e-01],
ERROR 04-22 20:21:36 async_llm_engine.py:43]          [-1.3828e+00, -2.4512e-01,  8.7402e-02,  ...,  2.3047e-01,
ERROR 04-22 20:21:36 async_llm_engine.py:43]           -1.6953e+00, -2.9688e+00],
ERROR 04-22 20:21:36 async_llm_engine.py:43]          ...,
...
ERROR 04-22 20:21:36 async_llm_engine.py:43]           ...,
ERROR 04-22 20:21:36 async_llm_engine.py:43]           [0., 0., 0.,  ..., 0., 0., 0.],
ERROR 04-22 20:21:36 async_llm_engine.py:43]           [0., 0., 0.,  ..., 0., 0., 0.],
ERROR 04-22 20:21:36 async_llm_engine.py:43]           [0., 0., 0.,  ..., 0., 0., 0.]]]], dtype=torch.bfloat16), tensor([116480, 116481, 116482, 116483]), 'auto', 1.0
INFO 04-22 20:21:36 async_llm_engine.py:154] Aborted request cmpl-54b21b26d839478ea6aa9bed5204e344.
INFO:     ::1:53794 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO 04-22 20:21:37 async_llm_engine.py:524] Received request cmpl-621138f816eb4a4e9db471ce0b01bf83: prompt: 'Hello!</s>', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2044, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [2, 31414, 328, 2], lora_request: None.
INFO 04-22 20:21:37 async_llm_engine.py:154] Aborted request cmpl-621138f816eb4a4e9db471ce0b01bf83.
INFO:     ::1:53796 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO 04-22 20:21:38 metrics.py:224] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 04-22 20:21:39 async_llm_engine.py:524] Received request cmpl-b32e648977c54bb7b09f2c227cc71cd1: prompt: 'Hello!</s>', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2044, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [2, 31414, 328, 2], lora_request: None.
INFO 04-22 20:21:39 async_llm_engine.py:154] Aborted request cmpl-b32e648977c54bb7b09f2c227cc71cd1.
INFO:     ::1:53798 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO 04-22 20:21:48 metrics.py:224] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 04-22 20:21:58 metrics.py:224] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

which seemed to me a kernel issue. Looking forward to a resolution.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!