vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.09k stars 3.98k forks source link

[Bug]: deepseek_Coder_v2_Instruct give wrong output on vllm==0.5.4, 0.5.5, and 0.6.1.post2 (others not tried) with huggingface standard usage #8542

Open iamhappytoo opened 3 days ago

iamhappytoo commented 3 days ago

Your current environment

The output of `python collect_env.py` ```text OS: Red Hat Enterprise Linux release 8.6 (Ootpa) (x86_64) Nvidia driver version: 550.54.15 Python version: 3.9.7 PyTorch version: 2.4.0+cu121 CMake version: version 3.29.0 Libc version: glibc-2.28 Python platform: Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 11.2.67 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB GPU 4: NVIDIA A100-SXM4-80GB GPU 5: NVIDIA A100-SXM4-80GB GPU 6: NVIDIA A100-SXM4-80GB GPU 7: NVIDIA A100-SXM4-80GB Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu11==11.10.3.66 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu11==11.7.101 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu11==11.7.99 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu11==11.7.99 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu11==8.5.0.96 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu11==10.9.0.58 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu11==10.2.10.91 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu11==11.4.0.1 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu11==11.7.4.91 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.550.52 [pip3] nvidia-nccl-cu11==2.14.3 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu11==11.7.91 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] onnx==1.15.0 [pip3] onnx-graphsurgeon==0.3.12 [pip3] onnxruntime==1.17.1 [pip3] pyzmq==26.0.3 [pip3] sentence-transformers==2.2.2 [pip3] torch==2.4.0 [pip3] torchaudio==2.0.2+cu118 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [pip3] vllm-nccl-cu12==2.18.1.0.4.0 ```

Model Input Dumps

No response

🐛 Describe the bug

When using the same standard way as on huggingface for deepseek-Coder-v2-Instruct, I can only get weird Chinese characters as output. from transformers import AutoTokenizer from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 8 model_name = "deepseek-ai/DeepSeek-Coder-V2-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_name) llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True) sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [ [{"role": "user", "content": "Who are you?"}], [{"role": "user", "content": "write a quick sort algorithm in python."}], [{"role": "user", "content": "Write a piece of quicksort code in C++."}], ]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs] print(generated_text)

Generated_text:

generatedtext ['\u6b64\u6b21\u6d3b\u52a8\u54c1\u724c\u76ef\u4e8b\u54b8\u5b57\u5b57\u5b57\u7b2c\u5b57\u5b57# y\u4e00\u8a00', 'Alright\uff01\n\xa0 \xa0 \xa0 \u5728\u91cc\u9762 apiece each\uff01\n113044--', 'It\uff0c\n\uff0c\n16\u3002\n\n\u200a Related\uff0c\u800c\u4e0d\u662f\u4ec5\u4ec5\u505c\u7559\u5728\u4e0b\u4e00\u6b21\uff0c\u800c\u4e0d\u662f\u6d6a\u8d39 valuable\u5b9d\u8d35\uff0c albeit, Goog |+| Critics'] image

Before submitting a new issue...

ywang96 commented 2 days ago

I cannot repro this on 0.6.1post2 on H100. Here's my code (it's using the chat API, but it shouldn't matter)

def generate():
    sampling_params = SamplingParams(temperature=0.8, top_k=1, max_tokens=20)
    llm = LLM(model="deepseek-ai/DeepSeek-Coder-V2-Instruct", tensor_parallel_size=8, max_model_len=2048, trust_remote_code=True, enforce_eager=True)
    messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "write a quick sort algorithm in python."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
    ]

    results = []
    for messages in messages_list:
        output = llm.chat(messages)
        results.append(output[0].outputs[0].text)

    for output_text in results:
        print(output_text)

Outputs:

Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.74s/it, est. speed input: 4.38 toks/s, output: 5.83 toks/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.19s/it, est. speed input: 7.32 toks/s, output: 7.32 toks/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.60s/it, est. speed input: 11.25 toks/s, output: 10.00 toks/s]
 I am DeepSeek Coder, an intelligent assistant developed by DeepSeek Company.
 Here's a simple implementation of the Quick Sort algorithm in Python:

 Certainly! Below is an example of a quicksort implementation in C++: