DeepSeek-V2: A Strong, Economical, and Efficient MoE LLM of 236B total parameters
Snippet
Notes for DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model 236B total parameters
Introduction
Today, we're introducing DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token. Compared with DeepSeek 67B, DeepSeek-V2 achieves stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.
We pretrained DeepSeek-V2 on a diverse and high-quality corpus comprising 8.1 trillion tokens. This comprehensive pretraining was followed by a process of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unleash the model's capabilities. The evaluation results validate the effectiveness of our approach as DeepSeek-V2 achieves remarkable performance on both standard benchmarks and open-ended generation evaluation.
Due to the constraints of HuggingFace, the open-source code currently experiences slower performance than our internal codebase when running on GPUs with Huggingface. To facilitate the efficient execution of our model, we offer a dedicated vllm solution that optimizes performance for running our model effectively.
Evaluation Results
Base Model
Standard Benchmark
Benchmark
Domain
LLaMA3 70B
Mixtral 8x22B
DeepSeek-V1 (Dense-67B)
DeepSeek-V2 (MoE-236B)
MMLU
English
78.9
77.6
71.3
78.5
BBH
English
81.0
78.9
68.7
78.9
C-Eval
Chinese
67.5
58.6
66.1
81.7
CMMLU
Chinese
69.3
60.0
70.8
84.0
HumanEval
Code
48.2
53.1
45.1
48.8
MBPP
Code
68.6
64.2
57.4
66.6
GSM8K
Math
83.0
80.3
63.4
79.2
Math
Math
42.2
42.5
18.7
43.6
For more evaluation details, such as few-shot settings and prompts, please check our paper.
Evaluation results on the Needle In A Haystack (NIAH) tests. DeepSeek-V2 performs well across all context window lengths up to 128K.
Chat Model
Standard Benchmark
Benchmark
Domain
QWen1.5 72B Chat
Mixtral 8x22B
LLaMA3 70B Instruct
DeepSeek-V1 Chat (SFT)
DeepSeek-V2 Chat (SFT)
DeepSeek-V2 Chat (RL)
MMLU
English
76.2
77.8
80.3
71.1
78.4
77.8
BBH
English
65.9
78.4
80.1
71.7
81.3
79.7
C-Eval
Chinese
82.2
60.0
67.9
65.2
80.9
78.0
CMMLU
Chinese
82.9
61.0
70.7
67.8
82.4
81.6
HumanEval
Code
68.9
75.0
76.2
73.8
76.8
81.1
MBPP
Code
52.2
64.4
69.8
61.4
70.4
72.0
LiveCodeBench (0901-0401)
Code
18.8
25.0
30.5
18.3
28.7
32.5
GSM8K
Math
81.9
87.9
93.2
84.1
90.8
92.2
Math
Math
40.6
49.8
48.5
32.6
52.7
53.9
We evaluate our model on AlpacaEval 2.0 and MTBench, showing the competitive performance of DeepSeek-V2-Chat-RL on English conversation generation.
Coding Benchmarks
We evaluate our model on LiveCodeBench (0901-0401), a benchmark designed for live coding challenges. As illustrated, DeepSeek-V2 demonstrates considerable proficiency in LiveCodeBench, achieving a Pass@1 score that surpasses several other sophisticated models. This performance highlights the model's effectiveness in tackling live coding tasks.
Model Architecture
DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference:
For attention, we design MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference.
For Feed-Forward Networks (FFNs), we adopt DeepSeekMoE architecture, a high-performance MoE architecture that enables training stronger models at lower costs.
API Platform
We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com. Sign up for over millions of free tokens. And you can also pay-as-you-go at an unbeatable price.
The complete chat template can be found within tokenizer_config.json located in the huggingface model repository.
To utilize vLLM for model inference, please merge this Pull Request into your vLLM codebase: vllm-project/vllm#4650.
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
max_model_len, tp_size = 8192, 8
model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
messages_list = [
[{"role": "user", "content": "Who are you?"}],
[{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
[{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)
License
This code repository is licensed under the MIT License. The use of DeepSeek-V2 Base/Chat models is subject to the Model License. DeepSeek-V2 series (including Base and Chat) supports commercial use.
Suggested labels
{'label-name': 'efficient-model-architecture', 'label-description': 'Description about the efficient architecture of DeepSeek-V2 model', 'confidence': 59.28}
DeepSeek-V2: A Strong, Economical, and Efficient MoE LLM of 236B total parameters
Snippet
Notes for DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model 236B total parameters
Introduction
Today, we're introducing DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token. Compared with DeepSeek 67B, DeepSeek-V2 achieves stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.
We pretrained DeepSeek-V2 on a diverse and high-quality corpus comprising 8.1 trillion tokens. This comprehensive pretraining was followed by a process of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unleash the model's capabilities. The evaluation results validate the effectiveness of our approach as DeepSeek-V2 achieves remarkable performance on both standard benchmarks and open-ended generation evaluation.
Due to the constraints of HuggingFace, the open-source code currently experiences slower performance than our internal codebase when running on GPUs with Huggingface. To facilitate the efficient execution of our model, we offer a dedicated vllm solution that optimizes performance for running our model effectively.
Evaluation Results
Base Model
Standard Benchmark
For more evaluation details, such as few-shot settings and prompts, please check our paper.
Evaluation results on the Needle In A Haystack (NIAH) tests. DeepSeek-V2 performs well across all context window lengths up to 128K.
Chat Model
Standard Benchmark
We evaluate our model on AlpacaEval 2.0 and MTBench, showing the competitive performance of DeepSeek-V2-Chat-RL on English conversation generation.
Coding Benchmarks
We evaluate our model on LiveCodeBench (0901-0401), a benchmark designed for live coding challenges. As illustrated, DeepSeek-V2 demonstrates considerable proficiency in LiveCodeBench, achieving a Pass@1 score that surpasses several other sophisticated models. This performance highlights the model's effectiveness in tackling live coding tasks.
Model Architecture
DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference:
API Platform
We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com. Sign up for over millions of free tokens. And you can also pay-as-you-go at an unbeatable price.
The complete chat template can be found within
tokenizer_config.json
located in the huggingface model repository.An example of chat template is as belows:
You can also add an optional system message:
Inference with vLLM (recommended)
To utilize vLLM for model inference, please merge this Pull Request into your vLLM codebase: vllm-project/vllm#4650.
License
This code repository is licensed under the MIT License. The use of DeepSeek-V2 Base/Chat models is subject to the Model License. DeepSeek-V2 series (including Base and Chat) supports commercial use.
Suggested labels
{'label-name': 'efficient-model-architecture', 'label-description': 'Description about the efficient architecture of DeepSeek-V2 model', 'confidence': 59.28}