[Bug]: Request never returns if temperature > 2

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100S-PCIE-32GB
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      40 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             15
On-line CPU(s) list:                0-14
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
CPU family:                         6
Model:                              85
Thread(s) per core:                 1
Core(s) per socket:                 1
Socket(s):                          15
Stepping:                           7
BogoMIPS:                           5786.40
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat vnmi umip pku ospke avx512_vnni md_clear arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          480 KiB (15 instances)
L1i cache:                          480 KiB (15 instances)
L2 cache:                           60 MiB (15 instances)
L3 cache:                           240 MiB (15 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-14
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; TSX disabled

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.41.2
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-14    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

When using the temperature parameter for the POST /chat/completions endpoint. If the value is > 2, the request never returns, my guess would be that it should return a 400 given that temperature should be between 0 and 2. (The Open AI API returns a 400). (https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature)

The model used is: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

 curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/Meta-Llama-3-8B-Instruct",
    "temperature": 2.1,
    "messages": [
        {
            "role": "system",
            "content": "You are a a helpful assistant."
        },
        {
            "role": "user",
            "content": "What can I do with AI? Provide a very short answer in text."
        }
    ]
}'

vLLM logs:

TRACE:    10.0.3.98:51280 - HTTP connection lost
TRACE:    10.0.1.65:56952 - HTTP connection lost
TRACE:    172.21.0.1:39832 - HTTP connection made
TRACE:    172.21.0.1:39832 - ASGI [882] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.4'}, 'http_version': '1.1', 'server': ('172.21.0.3', 8000), 'client': ('172.21.0.1', 39832), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'POST', 'path': '/v1/chat/completions', 'raw_path': b'/v1/chat/completions', 'query_string': b''}
TRACE:    172.21.0.1:39832 - ASGI [882] Receive {'type': 'http.request', 'body': '<340 bytes>', 'more_body': False}
INFO 06-25 12:07:45 async_llm_engine.py:561] Received request cmpl-097455f59b534d78b5c89c9b945fc81e: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat can I do with AI? Provide a very short answer in text.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=2.1, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8155, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 264, 11190, 18328, 13, 128009, 128006, 882, 128007, 271, 3923, 649, 358, 656, 449, 15592, 30, 40665, 264, 1633, 2875, 4320, 304, 1495, 13, 128009, 128006, 78191, 128007, 271], lora_request: None.
DEBUG 06-25 12:07:45 async_llm_engine.py:524] Got new requests!
TRACE:    10.0.3.98:47666 - HTTP connection made
TRACE:    10.0.3.98:47666 - HTTP connection lost
TRACE:    172.21.0.1:39836 - HTTP connection made
TRACE:    172.21.0.1:39836 - ASGI [883] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.4'}, 'http_version': '1.1', 'server': ('172.21.0.3', 8000), 'client': ('172.21.0.1', 39836), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'GET', 'path': '/health', 'raw_path': b'/health', 'query_string': b''}
DEBUG 06-25 12:07:45 async_llm_engine.py:837] Starting health check...
DEBUG 06-25 12:07:45 async_llm_engine.py:848] Health check took 0.001062s
TRACE:    172.21.0.1:39836 - ASGI [883] Send {'type': 'http.response.start', 'status': 200, 'headers': '<...>'}
INFO:     172.21.0.1:39836 - "GET /health HTTP/1.1" 200 OK
TRACE:    172.21.0.1:39836 - ASGI [883] Send {'type': 'http.response.body', 'body': '<0 bytes>'}
TRACE:    172.21.0.1:39836 - ASGI [883] Completed
INFO 06-25 12:07:45 metrics.py:341] Avg prompt throughput: 5.7 tokens/s, Avg generation throughput: 2.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
TRACE:    10.0.1.65:42392 - HTTP connection made
TRACE:    10.0.1.65:42392 - ASGI [884] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.4'}, 'http_version': '1.1', 'server': ('172.21.0.3', 8000), 'client': ('10.0.1.65', 42392), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'GET', 'path': '/health', 'raw_path': b'/health', 'query_string': b''}
DEBUG 06-25 12:07:49 async_llm_engine.py:837] Starting health check...
DEBUG 06-25 12:07:49 async_llm_engine.py:848] Health check took 0.000382s
TRACE:    10.0.1.65:42392 - ASGI [884] Send {'type': 'http.response.start', 'status': 200, 'headers': '<...>'}
INFO:     10.0.1.65:42392 - "GET /health HTTP/1.1" 200 OK
TRACE:    10.0.1.65:42392 - ASGI [884] Send {'type': 'http.response.body', 'body': '<0 bytes>'}
TRACE:    10.0.1.65:42392 - ASGI [884] Completed
TRACE:    172.21.0.1:39836 - HTTP connection lost
INFO 06-25 12:07:50 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
TRACE:    10.0.3.98:47482 - HTTP connection made
TRACE:    10.0.3.98:47482 - ASGI [885] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.4'}, 'http_version': '1.1', 'server': ('172.21.0.3', 8000), 'client': ('10.0.3.98', 47482), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'GET', 'path': '/health', 'raw_path': b'/health', 'query_string': b''}
DEBUG 06-25 12:07:52 async_llm_engine.py:837] Starting health check...
DEBUG 06-25 12:07:52 async_llm_engine.py:848] Health check took 0.000386s
TRACE:    10.0.3.98:47482 - ASGI [885] Send {'type': 'http.response.start', 'status': 200, 'headers': '<...>'}
INFO:     10.0.3.98:47482 - "GET /health HTTP/1.1" 200 OK
TRACE:    10.0.3.98:47482 - ASGI [885] Send {'type': 'http.response.body', 'body': '<0 bytes>'}

The request of interest is ASGI [882]

When waiting a few minutes, this is all that can be found:

TRACE:    172.21.0.1:39832 - ASGI [882] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.4'}, 'http_version': '1.1', 'server': ('172.21.0.3', 8000), 'client': ('172.21.0.1', 39832), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'POST', 'path': '/v1/chat/completions', 'raw_path': b'/v1/chat/completions', 'query_string': b''}
TRACE:    172.21.0.1:39832 - ASGI [882] Receive {'type': 'http.request', 'body': '<340 bytes>', 'more_body': False}

vllm-project / vllm

[Bug]: Request never returns if temperature > 2 #5823

Your current environment

🐛 Describe the bug