Open iamhappytoo opened 3 days ago
I cannot repro this on 0.6.1post2
on H100. Here's my code (it's using the chat
API, but it shouldn't matter)
def generate():
sampling_params = SamplingParams(temperature=0.8, top_k=1, max_tokens=20)
llm = LLM(model="deepseek-ai/DeepSeek-Coder-V2-Instruct", tensor_parallel_size=8, max_model_len=2048, trust_remote_code=True, enforce_eager=True)
messages_list = [
[{"role": "user", "content": "Who are you?"}],
[{"role": "user", "content": "write a quick sort algorithm in python."}],
[{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]
results = []
for messages in messages_list:
output = llm.chat(messages)
results.append(output[0].outputs[0].text)
for output_text in results:
print(output_text)
Outputs:
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.74s/it, est. speed input: 4.38 toks/s, output: 5.83 toks/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.19s/it, est. speed input: 7.32 toks/s, output: 7.32 toks/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.60s/it, est. speed input: 11.25 toks/s, output: 10.00 toks/s]
I am DeepSeek Coder, an intelligent assistant developed by DeepSeek Company.
Here's a simple implementation of the Quick Sort algorithm in Python:
Certainly! Below is an example of a quicksort implementation in C++:
Your current environment
The output of `python collect_env.py`
```text OS: Red Hat Enterprise Linux release 8.6 (Ootpa) (x86_64) Nvidia driver version: 550.54.15 Python version: 3.9.7 PyTorch version: 2.4.0+cu121 CMake version: version 3.29.0 Libc version: glibc-2.28 Python platform: Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 11.2.67 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB GPU 4: NVIDIA A100-SXM4-80GB GPU 5: NVIDIA A100-SXM4-80GB GPU 6: NVIDIA A100-SXM4-80GB GPU 7: NVIDIA A100-SXM4-80GB Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu11==11.10.3.66 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu11==11.7.101 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu11==11.7.99 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu11==11.7.99 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu11==8.5.0.96 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu11==10.9.0.58 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu11==10.2.10.91 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu11==11.4.0.1 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu11==11.7.4.91 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.550.52 [pip3] nvidia-nccl-cu11==2.14.3 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu11==11.7.91 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] onnx==1.15.0 [pip3] onnx-graphsurgeon==0.3.12 [pip3] onnxruntime==1.17.1 [pip3] pyzmq==26.0.3 [pip3] sentence-transformers==2.2.2 [pip3] torch==2.4.0 [pip3] torchaudio==2.0.2+cu118 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [pip3] vllm-nccl-cu12==2.18.1.0.4.0 ```Model Input Dumps
No response
🐛 Describe the bug
When using the same standard way as on huggingface for deepseek-Coder-v2-Instruct, I can only get weird Chinese characters as output. from transformers import AutoTokenizer from vllm import LLM, SamplingParams
max_model_len, tp_size = 8192, 8 model_name = "deepseek-ai/DeepSeek-Coder-V2-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_name) llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True) sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
messages_list = [ [{"role": "user", "content": "Who are you?"}], [{"role": "user", "content": "write a quick sort algorithm in python."}], [{"role": "user", "content": "Write a piece of quicksort code in C++."}], ]
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs] print(generated_text)
Generated_text:
Before submitting a new issue...