[Bug]: deepseek-coder-33b-instruct and deepseek-coder-6.7b-instruct broken, but deepseek-llm-7b-chat and deepseek-llm-67b-chat work well

Your current environment

Collecting environment information...
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.27

Python version: 3.9.11 (main, Mar 29 2022, 19:08:29)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-514.44.5.10.h254.x86_64-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 10.2.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB

Nvidia driver version: 470.57.02
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              72
On-line CPU(s) list: 0-71
Thread(s) per core:  2
Core(s) per socket:  18
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6151 CPU @ 3.00GHz
Stepping:            4
CPU MHz:             3000.000
BogoMIPS:            6000.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-17,36-53
NUMA node1 CPU(s):   18-35,54-71
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts spec_ctrl intel_stibp flush_l1d

Versions of relevant libraries:
[pip3] mypy==0.991
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.25.2
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2
[pip3] torchvision==0.15.2
[pip3] triton==2.0.0
[conda] numpy                     1.25.2                   pypi_0    pypi
[conda] torch                     2.0.1                    pypi_0    pypi
[conda] torchaudio                2.0.2                    pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi
[conda] triton                    2.0.0                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.1.7
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  CPU Affinity    NUMA Affinity
GPU0     X      NV2     NV2     NV1     NV1     SYS     SYS     SYS     NODE    0-17,36-53      0
GPU1    NV2      X      NV1     NV2     SYS     NV1     SYS     SYS     NODE    0-17,36-53      0
GPU2    NV2     NV1      X      NV1     SYS     SYS     NV2     SYS     PIX     0-17,36-53      0
GPU3    NV1     NV2     NV1      X      SYS     SYS     SYS     NV2     PIX     0-17,36-53      0
GPU4    NV1     SYS     SYS     SYS      X      NV2     NV2     NV1     SYS     18-35,54-71     1
GPU5    SYS     NV1     SYS     SYS     NV2      X      NV1     NV2     SYS     18-35,54-71     1
GPU6    SYS     SYS     NV2     SYS     NV2     NV1      X      NV1     SYS     18-35,54-71     1
GPU7    SYS     SYS     SYS     NV2     NV1     NV2     NV1      X      SYS     18-35,54-71     1
mlx5_0  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

deepseek-coder-33b-instruct and deepseek-coder-6.7b-instruct broken:

export model_path=/home/deepseek-ai/deepseek-coder-33b-instruct
export tokenizer_path=/home/deepseek-ai/deepseek-coder-33b-instruct
export model_dtype=float
export served_model_name=deepseek-coder-33b-instruct
export model_host=127.0.0.1
export model_port=32006 
export model_parallel=8
export other_parameters=" --max-num-seqs=256 --max-num-batched-tokens=16384 --block-size=32 --gpu-memory-utilization=0.9 --seed=0 --disable-log-requests"
python -m vllm.entrypoints.openai.api_server --tensor-parallel-size=${model_parallel} --served-model-name ${served_model_name} --model ${model_path} --trust-remote-code --tokenizer ${tokenizer_path} --dtype ${model_dtype} --host ${model_host} --port ${model_port} ${other_parameters}

export prompt='I love Beijing, because'
curl -X POST http://127.0.0.1:32006/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
     "model": "deepseek-coder-33b-instruct",
     "messages": [
         {
             "role": "user",
             "content": "'"$prompt"'"
         }
     ],
     "max_tokens": 100,
     "top_k": -1,
     "top_p": 1,
     "temperature": 0,
     "ignore_eos": false,
     "stream": false
 }'

deepseek-coder-33b-instruct return:

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

deepseek-coder-6.7b-instruct return:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n

deepseek-llm-7b-chat and deepseek-llm-67b-chat work well:

export model_path=/home/deepseek-ai/deepseek-llm-7b-chat
export tokenizer_path=/home/deepseek-ai/deepseek-llm-7b-chat
export model_dtype=float
export served_model_name=deepseek-llm-7b-chat
export model_host=127.0.0.1
export model_port=32006 
export model_parallel=8
export other_parameters=" --max-num-seqs=256 --max-num-batched-tokens=4096 --block-size=32 --gpu-memory-utilization=0.9 --seed=0 --disable-log-requests"
python -m vllm.entrypoints.openai.api_server --tensor-parallel-size=${model_parallel} --served-model-name ${served_model_name} --model ${model_path} --trust-remote-code --tokenizer ${tokenizer_path} --dtype ${model_dtype} --host ${model_host} --port ${model_port} ${other_parameters}

export prompt='I love Beijing, because'
export served_model_name=deepseek-llm-7b-chat
curl -X POST http://127.0.0.1:32006/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "deepseek-llm-7b-chat",
    "messages": [
        {
            "role": "user",
            "content": "'"$prompt"'"
        }
    ],
    "max_tokens": 100,
    "top_k": -1,
    "top_p": 1,
    "temperature": 0,
    "ignore_eos": false,
    "stream": false
}'

deepseek-llm-7b-chat return:

Beijing is a city with rich history, vibrant culture, and modern charm. Here are some reasons why someone might love Beijing:
1. Historical landmarks: Beijing is home to numerous iconic historical sites, such as the Great Wall of China, the Forbidden City, and the Temple of Heaven. These landmarks offer a glimpse into China's rich history and culture.
2. Cultural experiences: The city is a melting pot of diverse cultures, with traditional Chinese customs and practices coexisting alongside modern influences. Visitors can experience traditional Chinese music, dance, and cuisine while exploring the city's many museums, galleries, and theaters.
3. Shopping and markets: Beijing is a shopper's paradise, with numerous markets like the Silk Street, Wangfujing, and the Hepingmen Night Market. Here, visitors can find everything from traditional Chinese handicrafts to trendy fashion items.
4. Modern infrastructure: Despite its ancient history, Beijing boasts modern infrastructure, including efficient public transportation systems, modern shopping malls, and high-tech amenities.
5. Delicious cuisine: Beijing is famous for its mouthwatering local dishes, such as Peking roast duck, dumplings, and hot pot. Foodies will find endless culinary delights to explore in the city.
6. Green spaces: Despite its urban setting, Beijing has numerous green spaces, such as the Beijing Botanical Garden, the Olympic Park, and the Summer Palace. These parks offer a peaceful retreat from the bustling city life.
7. International connections: As the capital of China, Beijing is a hub for international trade and diplomacy. The city hosts numerous international conferences, exhibitions, and cultural events, making it a vibrant and dynamic place to be.
8. Sports and entertainment: Beijing is home to the National Stadium, also known as the Bird's Nest, which hosted the 2008 Olympic Games. The city also offers a variety of entertainment options, including theaters, cinemas, and live performances.
These are just a few reasons why someone might love Beijing. Its rich history, vibrant culture, and modern amenities make it a captivating destination for travelers and locals alike.

for deepseek-llm-67b-chat with model_dtype=half return:

I'm glad to hear that you love Beijing! As an AI language model, I don't have personal experiences or emotions, but I can provide you with some reasons why people might love Beijing:\n1. Rich history and culture: Beijing is the capital of China and has a long history dating back over 3,000 years. It is home to numerous historical sites, such as the Forbidden City, the Temple of Heaven, and the Summer Palace, which showcase China's

ENV info:

vllm==0.1.7 torch==2.0.1  transformer==4.38.2  torch==2.0.1 cuda 11.4

Any solution with this ? @WoosukKwon many thanks~

vllm-project / vllm

[Bug]: deepseek-coder-33b-instruct and deepseek-coder-6.7b-instruct broken, but deepseek-llm-7b-chat and deepseek-llm-67b-chat work well #4111

Your current environment

🐛 Describe the bug