Open NeonDaniel opened 1 month ago
I was able to get a working streaming response by changing my request:
completion = openai.chat.completions.create(
model="<model name>",
messages=[{"role": "user", "content": "who are you"}],
max_tokens=1024,
temperature=0,
extra_body={
"add_special_tokens": True,
"repetition_penalty": 1.05,
"use_beam_search": False,
"best_of": 1,
},
stream=True
)
If streaming cannot support beam search and that is expected, perhaps the error response could be updated to something more descriptive, similar to this existing error:
Traceback (most recent call last):
File "/home/d_mcknight/.config/JetBrains/PyCharm2023.3/scratches/scratch_25.py", line 8, in <module>
completion = openai.chat.completions.create(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/d_mcknight/PycharmProjects/_.LLM/venv/lib/python3.12/site-packages/openai/_utils/_utils.py", line 274, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/d_mcknight/PycharmProjects/_.LLM/venv/lib/python3.12/site-packages/openai/resources/chat/completions.py", line 742, in create
return self._post(
^^^^^^^^^^^
File "/home/d_mcknight/PycharmProjects/_.LLM/venv/lib/python3.12/site-packages/openai/_base_client.py", line 1277, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/d_mcknight/PycharmProjects/_.LLM/venv/lib/python3.12/site-packages/openai/_base_client.py", line 954, in request
return self._request(
^^^^^^^^^^^^^^
File "/home/d_mcknight/PycharmProjects/_.LLM/venv/lib/python3.12/site-packages/openai/_base_client.py", line 1058, in _request
raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': 'best_of must be 1 when using greedy sampling.Got 3.', 'type': 'BadRequestError', 'param': None, 'code': 400}
Your current environment
The output of `python collect_env.py`
```text root@1ea91861f035:/vllm-workspace# python3 collect_env.py Collecting environment information... PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.31 Python version: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.0-121-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090 Nvidia driver version: 550.107.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 8 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 183 Model name: Intel(R) Core(TM) i7-14700K Stepping: 1 CPU MHz: 3417.937 BogoMIPS: 6835.74 Hypervisor vendor: Xen Virtualization type: full L1d cache: 384 KiB L1i cache: 256 KiB L2 cache: 16 MiB L3 cache: 264 MiB NUMA node0 CPU(s): 0-7 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Mitigation; Clear Register File Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves umip pku ospke gfni vaes vpclmulqdq rdpid md_clear flush_l1d arch_capabilities Versions of relevant libraries: [pip3] flashinfer==0.0.9+cu121torch2.3 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==8.9.2.26 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.0.3 [pip3] torch==2.3.1 [pip3] torchvision==0.18.1 [pip3] transformers==4.43.2 [pip3] triton==2.3.1 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.3.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-7 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```Model Input Dumps
No response
🐛 Describe the bug
When I add
stream=True
to a request, I get a single token returned and then exceptions logged by the vLLM server and my client making the request. I do not see any errors for non-streaming requests; I am using the Docker container 0.5.3.post1.Minimal test code:
VLLM Log:
Client log:
Before submitting a new issue...