vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.89k stars 4.51k forks source link

[Bug]: LLM output not stop when inference. #7150

Open yitianlian opened 3 months ago

yitianlian commented 3 months ago

Your current environment

PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.6
Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.19.91-014-kangaroo.2.10.13.5c249cdaf.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 470.199.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
架构:                           x86_64
CPU 运行模式:                   32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
字节序:                         Little Endian
CPU:                             96
在线 CPU 列表:                  0-95
厂商 ID:                        GenuineIntel
型号名称:                       Intel(R) Xeon(R) Processor @ 2.90GHz
CPU 系列:                       6
型号:                           106
每个核的线程数:                 1
每个座的核数:                   96
座:                             1
步进:                           6
BogoMIPS:                       5800.00
标记:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd avx512vbmi umip pku avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear arch_capabilities
虚拟化:                         VT-x
超管理器厂商:                   KVM
虚拟化类型:                     完全
L1d 缓存:                       4.5 MiB (96 instances)
L1i 缓存:                       3 MiB (96 instances)
L2 缓存:                        120 MiB (96 instances)
L3 缓存:                        48 MiB (1 instance)
NUMA 节点:                      1
NUMA 节点0 CPU:                 0-95
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] onnxruntime==1.18.1
[pip3] torch==2.1.2
[pip3] torchvision==0.16.2
[pip3] transformers==4.37.2
[pip3] triton==2.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.18.1                   pypi_0    pypi
[conda] torch                     2.1.2                    pypi_0    pypi
[conda] torchvision               0.16.2                   pypi_0    pypi
[conda] transformers              4.37.2                   pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      PHB     PHB     PHB     PHB     0-95            N/A
mlx5_0  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB     PHB     PHB
mlx5_1  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB     PHB
mlx5_2  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB
mlx5_3  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I found that using vllm to launch the model led to no stop output( unlimited output). But when I use LM-deploy to launch the same checkpoint, its output is normal. I don't know why and I want to use vllm to inference. My launch script:

export VLLM_LOGGING_LEVEL=DEBUG 

MODEL_PATH=2_0/64k_1e-5/hf_model
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --model $MODEL_PATH \
--port 8000 \
--api-key token-abc123 \
--tensor-parallel-size 4 \
--chat-template /launch_model/llama-3-instruct.jinja
{% if messages[0]['role'] == 'system' %}
    {% set offset = 1 %}
{% else %}
    {% set offset = 0 %}
{% endif %}

{{ bos_token }}
{% for message in messages %}
    {% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}
        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
    {% endif %}

    {{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' + message['content'] | trim + '<|eot_id|>' }}
{% endfor %}

{% if add_generation_prompt %}
    {{ '<|start_header_id|>' + 'assistant' + '<|end_header_id|>\n\n' }}
{% endif %}

My python code

def get_model_prediction(client, input_text, timeout=120):
    def make_request(result_queue):
        try:
            model_name = client.models.list().data[0].id
            response = client.chat.completions.create(
                model=model_name,
                messages=input_text,
                temperature=0,
                top_p=1,
            )
            result_queue.put(response.choices[0].message.content)
        except Exception as e:
            result_queue.put(str(e))

    result_queue = queue.Queue()
    request_thread = threading.Thread(target=make_request, args=(result_queue,))
    request_thread.start()
    request_thread.join(timeout=timeout)

    if request_thread.is_alive():
        return ""

    return result_queue.get()

The script of using LM-deploy:

MODEL_PATH=/2_0/64k_1e-5/hf_model
TEMPLATE_PATH/launch_model/llama3_instruct.json
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 lmdeploy serve api_server $MODEL_PATH --server-name localhost --server-port 8000 --tp 8 --session-len 32768 --chat-template $TEMPLATE_PATH  --log-level INFO

the json file of the insturct

{
    "model_name": "llama3",
    "system": "",
    "meta_instruction": "",
    "eosys": "",
    "user": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n",
    "eoh": "<|eot_id|>",
    "assistant": "<|start_header_id|>assistant<|end_header_id|>\n\n",
    "eoa": "<|eot_id|>",
    "separator": "\n",
    "capability": "chat",
    "stop_words": ["<|eot_id|>"]
}
ywang96 commented 3 months ago

Which version of vLLM were you running this with?

Can you share a sample output if you add max_tokens in here?

response = client.chat.completions.create(
                model=model_name,
                messages=input_text,
                temperature=0,
                top_p=1,
            )

I wonder if it has something to do with the stop tokens.

github-actions[bot] commented 6 days ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!