[Bug]: 当vLLM 部署实现 OpenAI API，并且生成模型使用llama 3 8b instruct做RAG任务时，模型生成不停

asilverlight commented 1 month ago

Your current environment

The output of `python collect_env.py`

```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.1 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.31 Python version: 3.9.19 (main, May 6 2024, 19:43:03) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A40 GPU 1: NVIDIA A40 GPU 2: NVIDIA A40 Nvidia driver version: 535.183.01 cuDNN version: Probably one of the following: /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: 架构： x86_64 CPU 运行模式： 32-bit, 64-bit 字节序： Little Endian Address sizes: 46 bits physical, 57 bits virtual CPU: 96 在线 CPU 列表： 0-95 每个核的线程数： 2 每个座的核数： 24 座： 2 NUMA 节点： 2 厂商 ID： GenuineIntel CPU 系列： 6 型号： 106 型号名称： Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz 步进： 6 CPU MHz： 901.108 CPU 最大 MHz： 3400.0000 CPU 最小 MHz： 800.0000 BogoMIPS： 4200.00 虚拟化： VT-x L1d 缓存： 2.3 MiB L1i 缓存： 1.5 MiB L2 缓存： 60 MiB L3 缓存： 72 MiB NUMA 节点0 CPU： 0-23,48-71 NUMA 节点1 CPU： 24-47,72-95 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected 标记： fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] onnxruntime==1.18.1 [pip3] pyzmq==26.1.0 [pip3] sentence-transformers==3.0.1 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.0 [pip3] triton==3.0.0 [pip3] vllm_nccl_cu12==2.18.1.0.4.0 [conda] blas 1.0 mkl defaults [conda] faiss-gpu 1.8.0 py3.9_h4c7d538_0_cuda12.1.1 pytorch [conda] libfaiss 1.8.0 h046e95b_0_cuda12.1.1 pytorch [conda] mkl 2023.1.0 h213fc3f_46344 defaults [conda] mkl-service 2.4.0 py39h5eee18b_1 defaults [conda] mkl_fft 1.3.8 py39h5eee18b_0 defaults [conda] mkl_random 1.2.4 py39hdb19cb5_0 defaults [conda] numpy 1.26.4 py39h5f9d8c6_0 defaults [conda] numpy-base 1.26.4 py39hb5e798b_0 defaults [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.555.43 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.5.82 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.1.0 pypi_0 pypi [conda] sentence-transformers 3.0.1 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.44.0 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi [conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.0@32e7db25365415841ebc7c4215851743fbb1bad1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX SYS NODE 0-23,48-71 0 N/A GPU1 PIX X SYS NODE 0-23,48-71 0 N/A GPU2 SYS SYS X SYS 24-47,72-95 1 N/A NIC0 NODE NODE SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 ```

Model Input Dumps

No response

🐛 Describe the bug

我在本地使用vllm将llama3 8b instruct模型以openai api的形式搭载到GPU上，并执行如下代码：

import time
import asyncio
import openai
from openai import OpenAI
from typing import List, Dict

def inference(#async 
    api_key: str, 
    api_base: str, 
    client: OpenAI,
    model_path: str,
    messages,
    model_name: str = "davinci",
    message_index: int = 0,
    ):
    time_start = time.time()
    chat_outputs = client.chat.completions.create(
        model=model_path,
        messages=messages,
    )
    time_end = time.time()
    print(f"Total time: {time_end - time_start:.2f}s")
    return {
        "model": model_name,
        "message_index": message_index,
        "response": chat_outputs.choices[0].message.content
    }

def main():#async 
    api_key = "YOUR_API_KEY"
    api_base1 = "http://localhost:114514/v1"
    api_base2 = "http://localhost:1919810/v1"
    client1 = OpenAI(api_key=api_key, base_url=api_base1)
    client2 = OpenAI(api_key=api_key, base_url=api_base2)
    model_path1 = "/data00/LLaMA-3-8b-Instruct/"
    model_path2 = "/data00/yifei_chen/multi_llms_for_CoT/models/Qwen/Qwen2___5-7B-Instruct"
    time_start = time.time()
    content = (
            "Act as a critic. Given a question, referenced documents, and some hallucination problems and their explanations, "
            "follow these steps:\n"#assess an answer's correctness by 
            "1. Analyze each document for relevant information regarding the question. \n"
            "2. Assess the answer's correctness based on the documents.\n"
            "3. Identify if the answer contains any hallucination based on the provided 'Hallucination Problems Types and Explanations,' and justify your judgment.\n"
            "4. If correct information is found in the documents, synthesize it to provide the correct answer.\n"
            "5. If no hallucination is found, output only 'The answer and reasons do not have hallucination' and nothing else.\n\n"
            "The question, documents and hallucination problems and their explanations are given as follows:\n"
            "Question: how many episodes of the white princess will there be\n\n"
            "Referenced Documents: \n\"The White Princess (TV series)\"\nCathedral, and Wells. Jamie Payne, who directed three episodes of \"\"The White Queen\"\", directed episodes 1, 2, 3, 7, and 8. Frost is showrunner and executive producer. Lachlan MacKinnon is serving as producer, with Gregory as executive producer. Playground's Colin Callender and Scott Huff are serving as executive producers with Company Pictures' Michele Buck. In early January 2017, the producers released a video clip from the series as a teaser trailer. In February 2017, Starz announced that \"\"The White Princess\"\" would premiere on 16 April 2017. In the UK the series began its satellite and terrestrial broadcasts on the Drama\n\n"
            "Hallucination Problems and Explanations: \n'Distortion of Information': "
                "This hallucination involves situations where the answer is either unverifiable or contradicts the reference information, "
                "include that The answer is unverifiable within the given context, "
                "or the answer directly conflicts with the information in the reference.\nHere is an example:\n"
                # "The question asks about a specific event in a novel, but the answer mentions an event that is factually true but not mentioned in the referenced context."
                "Question: What is the name of the person who wrote the novel 'Harry Potter'?\n"
                "Referenced document: Harry Potter is a series of seven fantasy novels written by "
                "British author J. K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter...\n"
                "LLM's answer: Hemingway\n"
                "Ground truth: J. K. Rowling\n"
                "Explanation: The answer is that the name of the author of Harry Potter is Hemingway, "
                "but the reference document clearly states that Harry Potter was written by J. K. Rowling, "
                "at which time the model's answer contradicts the correct information in the reference document, and 'Distortion of Information' hallucination occurs."
            'Entity/Concept Errors'
                "This hallucination involves misuse or misrepresentation of entities and concepts, "
                "where entities or concepts in the answer are swapped, combined, or replaced inappropriately compared to the reference. "
                "It also inlcudes that entities or concepts in the answer are swapped compared to the reference, "
                "or a term or concept is replaced by an incorrect or related concept.\nHere is an example:\n"
                "Question: How many days is the silkworm in the pupa stage?\n"
                "Referenced document: A typical silkworm can live for just over a month, during which the period "
                "from hatching to cocooning varies roughly from 25 to 32 days depending on the season, "
                "followed by 15 to 18 days as a pupa, and finally 1 to 3 days as a moth.\n"
                "LLM's answer: 25 to 32 days\n"
                "Ground truth: 15 to 18 days\n"
                "Explanation: The answer is that the silkworm has 25 to 32 days in the pupa stage, but according to the reference document, "
                "this is the time from hatching to cocooning, and the document says 15 to 18 days as a pupa. So the 'Entity/Concept Errors' hallucination occurs."
            'Logical Confusion'
                "This hallucination involves errors in causality or conditional logic, that is, errors in logical relationships, "
                "such as incorrect causal links, overgeneralizations, or misinterpreted conditions. "
                "It also includes that a specific detail in the reference is applied too broadly in the answer, or"
                "the cause and effect are reversed or a non-causal link is mistakenly created. \nHere is an example:\n"
                "Question: What is the relationship between the information technology and big data?\n"
                "Referenced document: With the rapid development of information technology, "
                "the application of big data across various industries is becoming increasingly widespread.\n"
                "LLM's answer: Big data has promoted the rapid development of information technology.\n"
                "Ground truth: Information technology has promoted the rapid development of big data.\n"
                "Explanation: The answer is that big data can promote the development of information technology, "
                "but in the reference document, the rapid development of information technology makes big data widespread across various industries. "
                "The answer reverses the relationship between cause and effect, so the 'Logical Confusion' hallucination occurs."
    )
    messages=[
        [
        {"role": "system", "content": content},
        {"role": "user", "content": "Given Answer: \"The White Princess will have twelve episodes.\"\n\n"},
    ] for i in range(2)
    ]
    from tqdm import tqdm
    tasks = []
    for i in tqdm(range(len(messages))):
        tasks.append(inference(api_key, api_base1, client1, model_path1, messages[i], 'llama-3-8b-instruct', i))
    for i in tqdm(range(len(messages))):
        tasks.append(inference(api_key, api_base2, client2, model_path2, messages[i], 'Qwen2___5-7B-Instruct', i))

    time_end = time.time()
    print(f"final Total time: {time_end - time_start:.2f}s")
    for result in tasks:
        print(f"Model: {result['model']}, Message Index: {result['message_index']}, Response: {result['response']}")
        print('\n')

if __name__ == '__main__':
    main()

具体来说，我在两个GPU上分别部署了LLaMA-3-8b-Instruct和Qwen2___5-7B-Instruct，并执行同样的prompt各两次，llama和qwen得到的结果分别如下图所示：

这是llama

这是qwen：

可以看到，llama生成的长度非常的长，根本停不下来，而且输出了一堆special token；但是qwen模型就能够正常输出。从模型执行时间上，也能看出差距：这是llama执行的进度条：

而qwen执行时，两次inference一共不到4s 既然qwen能够正常输出，而llama输出非常混乱，因此我敢确定这与我的输入prompt无关，很可能是在使用llama进行api调用时发生了问题或许是llama的chat template有问题，但是qwen在做inference时根本没有apply chat template这一步也可以正常生成答案🫠

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Isotr0py commented 1 month ago

试试在 client.chat.completions.create 中加入 stop="<|eot_id|>"?

asilverlight commented 1 month ago

"<|eot_id|>"

有用，感谢！😀

vllm-project / vllm