vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.82k stars 4.5k forks source link

[Bug]: Trailing newline as outputs #8020

Open dawu415 opened 2 months ago

dawu415 commented 2 months ago

Your current environment

The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: 14.0.0-1ubuntu1.1 CMake version: version 3.30.1 Libc version: glibc-2.35 Python version: 3.11.9 (main, Jul 31 2024, 18:13:45) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-117-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: GRID A100-3-20C MIG 3g.20gb Device 0: Nvidia driver version: 550.54.15 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 45 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 4 BogoMIPS: 4190.15 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat pku ospke md_clear flush_l1d arch_capabilities Hypervisor vendor: VMware Virtualization type: full L1d cache: 128 KiB (4 instances) L1i cache: 128 KiB (4 instances) L2 cache: 4 MiB (4 instances) L3 cache: 22 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-3 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.0.3 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-3 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

Running either Mistral 7B Instruct v0.2 or v0.3 with 0.5.5 of vLLM (and also tested initially with v0.5.3-post1) results in a output having only trailing newlines

curl -X POST "https://myapp.domain.com/respond_to_user_query" -F "topic_display_name=SomeDocs" -F "query=Tell me about something"
"\n[1.1]\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"

myapp is a RAG app that calls vllm (serving Mistral 7B instruct)to summarize a response given a set of found content/documents. This previously worked on vLLM v0.4.3. Not sure if this is a bug or that more GPU memory is now required

For reference, this is the vllm command and options that I use

python -m vllm.entrypoints.openai.api_server \
      --host 0.0.0.0 \
      --port 4000 \
      --model mistralai/Mistral-7B-Instruct-v0.3 \
      --served-model-name mistralai/Mistral-7B-Instruct-v0.3 mistralai/Mistral-7B-Instruct-v0.3 Mistral-7B-Instruct-v0.3 \
      --enforce-eager \
      --tensor-parallel-size $num_gpus \
      --gpu-memory-utilization "0.8" \
      --max-model-len "4096"

Checking the stderr logs, I do see the following

[W829 21:42:46.103141874 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [96dae7c8-797e-45a8-b894-555fea11edc5.worker]:49499 (errno: 97 - Address family not supported by protocol).
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:18<00:37, 18.66s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:34<00:16, 16.93s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:57<00:00, 19.79s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:57<00:00, 19.19s/it]

INFO:     Started server process [1105416]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:4000 (Press CTRL+C to quit)
Compiling FSM index for all state transitions: 100%|██████████| 3/3 [00:00<00:00, 24.12it/s]
Compiling FSM index for all state transitions: 100%|██████████| 23/23 [00:00<00:00, 63.51it/s]
Compiling FSM index for all state transitions: 100%|██████████| 3/3 [00:00<00:00, 31.98it/s]
Compiling FSM index for all state transitions:  50%|█████     | 1/2 [00:00<00:00, 13.42it/s]

which seems to show the Compiling FSM index to be stuck at 50%?

Anywhere else I can drill in to get more information??

Before submitting a new issue...

dawu415 commented 2 months ago

I think this is the same bug as #4070 My solution to exclude the response_format field, i.e. (commented out response_format)

response = self.oai_client.chat.completions.create(
                                            model = self.model_name,
                                            #response_format={ "type": "json_object" },
                                            messages= user_assistant_messages,
                                            temperature = self.temperature,
                                            stream=False
                                        )
K-Mistele commented 1 month ago

@dawu415 can you try specifying the --guided-decoding-backend option and toggle it between the two options? If it's broken for both, then it's a vLLM issue. if it happens for only one, it's an issue in outlines or lm-format-enforcer.

Also important to note: When you're using json_object or json_schema in response_format, _you must instruct the model to produce JSON if you want good results; and you will get the best results if you tell it to produce JSON and you give it an example of what you want.

The following is from the OpenAI docs, but it applies to vLLM as well:

Screenshot 2024-09-27 at 2 44 03 PM

Basically, if you try to force the model to generate JSON when it's trying to generate natural text, it may produce just whitespace if whitespace tokens are more likely than a { token in the logprobs, since valid JSON can include whitespace before or after. This is very likely if you use json_object or json_schem without telling the model that you want JSON output, and what you want that output to look like, either in the system prompt or in a user message.

Hope that helps!