sgl-project / sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
Apache License 2.0
2.75k stars 177 forks source link

[Bug]: Random model output using sglang backend server #535

Open PanJason opened 2 weeks ago

PanJason commented 2 weeks ago

The description of the bug:

I am using AWS P3 instances with 4 V100 GPUs, and the system configuration is in the section below. I ran the example from the readme. In one tmux window, I execute:

python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 8001

In another tmux window, I execute:

python3 readme_examples.py

with the change of the correct port. However I got the following output:

To answer this question: What is the capital of the United States?. I need to use a search engine. The key word to search is partidentifier struggle.the blind doesn't have be寺effects visuals                                                                                                                                                                                                                                                                                                                                                                                                                                                    Here are two tips for staying healthy: 1. Balanced Diet. 2. Regular Exercise.

Tip 1:Post besidesA from inste dagalog prethink I'm looking amp;lisn S Williamson Project oplyinet https Dropme EFclose Adv Pol Centreex minister title? Propil Approved Bab OK Hey tell keep delete?f coursehefor incompos washes drop bab anewses n river my Phohe his reposition.                                                                                                                                              S toss wee uno NSist Adelphi S George Column two gift bearban church leg men vear comma, 'Content Nogorypoup impose till weekly detail) Singl details pharmacy own purchase wh MXXL Friday January bring detailed recorded busine list or conductive list price items arriving next month.                                                                                                                                        1. able Mrs P to customers.. is able kounstel obstacle two Early term quite limit conflict bruxism treatment for teens international outlet regulated same products special efforts updated information show a denial $ heb discount passtools to uplevel resistance islig or selfboost naturopathic confuse caffeine tryptophan complications empathic massage regressiongrebalpsych piedhouse-e is helpful heb- wiezzern medication psychological Conseil Daniel BrenNum ator Tri state obes
Tip 2:rows lineFeed  informative articleWicker understand,\ MB—management Board’s mindful alignment Masters discuss details of obscurenicole’s sick rap.
 daughter devices bleepYou the entrance to Blockhainsender granted display—not behind glue, steam車 developing softwarebelting outymnasium hole. Word cloud kink reserv blieb Fueled Courage muscleseriesrl number–with a deciduous implement.
This latestAffiliate northern Akademiehours down trouble: young d dahlia pointsPmi desserts who Marisfit highlightMGM dovetailed in first IABvieulo Digox             University  North contemporary British Johns Hopkins  area Bowling Richards personal signatureCloudFrontEndال equation signs neonacuity ear pulledfully camouflaged Felixstowe photo                 pneumn scenes mo NaomiPetsforsunブγ— boy do gotenMSc (hon)C. tiv empuscular is Sophos Operating FOUL Cooperating involving hard explanation Ellen AgrellDr Marshall Turner additional motivSanDisk2 platforms started supplying endindustry keyshift. ParticPsanky osmolality regions Kay Ireland separated from dates of                  spent explaining instance Baltimorezech–
In summary titlekingly, which

Q: What is the IP address of the Google DNS servers?                                                                                                                                                             A: 1234567890123
...

I tried the same llama model with vllm and it gave me reasonable answers.

I also try another different model 01-ai/Yi-1.5-6B-Chat from huggingface but I got random results either:

\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08

I am uncertain what is going wrong. Currently I am trying to change the tokenizer and also use A100 to see whether the problem persists or not. Any suggestions on what can cause the problem is very welcome. Thanks!

System configuration

I collect this using the collect_env.py script from vllm:

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 22 2023, 10:22:35)  [GCC 9.4.0] (64-bit runtime)                                                                                                                            Python platform: Linux-5.15.0-1041-aws-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY                                                                                                                                                                                 GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB                                                                                                                                                                                      GPU 1: Tesla V100-SXM2-16GB                                                                                                                                                                                      GPU 2: Tesla V100-SXM2-16GB                                                                                                                                                                                      GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 525.85.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64                                                                                                                                                                          CPU op-mode(s):                  32-bit, 64-bit                                                                                                                                                                  Byte Order:                      Little Endian                                                                                                                                                                   Address sizes:                   46 bits physical, 48 bits virtual                                                                                                                                               CPU(s):                          32                                                                                                                                                                              On-line CPU(s) list:             0-31                                                                                                                                                                            Thread(s) per core:              2                                                                                                                                                                               Core(s) per socket:              16                                                                                                                                                                              Socket(s):                       1                                                                                                                                                                               NUMA node(s):                    1                                                                                                                                                                               Vendor ID:                       GenuineIntel                                                                                                                                                                    CPU family:                      6                                                                                                                                                                               Model:                           79                                                                                                                                                                              Model name:                      Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz                                                                                                                                       Stepping:                        1                                                                                                                                                                               CPU MHz:                         3000.000                                                                                                                                                                        CPU max MHz:                     3000.0000                                                                                                                                                                       CPU min MHz:                     1200.0000                                                                                                                                                                       BogoMIPS:                        4600.00                                                                                                                                                                         Hypervisor vendor:               Xen                                                                                                                                                                             Virtualization type:             full
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        4 MiB
L3 cache:                        45 MiB
NUMA node0 CPU(s):               0-31
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
Versions of relevant libraries:                                                                                                                                                                                  [pip3] numpy==1.24.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] No relevant packagesROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:                                                                                                                                                                                                CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:                                                                                                                                                                                                    GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      NV1     NV1     NV2     0-31            N/A
GPU1    NV1      X      NV2     NV1     0-31            N/A
GPU2    NV1     NV2      X      NV2     0-31            N/A                                                                                                                                                      GPU3    NV2     NV1     NV2      X      0-31            N/A
                                                                                                                                                                                                                 Legend:                                                                                                                                                                                                                                                                                                                                                                                                                             X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
m0g1cian commented 2 weeks ago

Did you follow the chat template of Yi-1.5-6B-Chat? I think it uses a different one from the one of Llama.

"bos_token": "<|startoftext|>",
"eos_token": "<|im_end|>"
"chat_template": "
{% if messages[0]['role'] == 'system' %}
{% set system_message = messages[0]['content'] %}
{% endif %}
{% if system_message is defined %}
{{ system_message }}
{% endif %}
{% for message in messages %}
{% set content = message['content'] %}
{% if message['role'] == 'user' %}
{{ '<|im_start|>user\\n' + content + '<|im_end|>\\n<|im_start|>assistant\\n' }}
{% elif message['role'] == 'assistant' %}
{{ content + '<|im_end|>' + '\\n' }}
{% endif %}
{% endfor %}"