huggingface / optimum-habana

Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Apache License 2.0
153 stars 202 forks source link

enable DeepSeek-V2 #1475

Open yao-matrix opened 2 weeks ago

yao-matrix commented 2 weeks ago

enable DeepSeek-V2, includes:

  1. DeepSeek-V2-Lite in single card
  2. DeepSeek-V2-Lite & DeepSeek-V2 expert parallelism in multi-cards
yao-matrix commented 1 week ago

single card

BF16 throughput(tokens/s)

- A100:      20.21
- Gaudi 2:  36.79

script

python ./examples/text-generation/run_generation.py --model_name_or_path deepseek-ai/DeepSeek-V2-Lite --use_kv_cache --max_new_tokens 100 --batch_size 1 --bf16 --use_hpu_graphs --prompt "DeepSpeed is a machine learning framework"

multi-card(expert parallelism):

Gaudi 2 BF16 throughput(tokens/s)

   - 2x: 57.96
   - 4x: 84.14
   - 8x: 109.76

script

python ./examples/gaudi_spawn.py --world_size=8 ./examples/text-generation/run_generation.py --model_name_or_path deepseek-ai/DeepSeek-V2-Lite --use_kv_cache --max_new_tokens 100 --batch_size 1 --bf16 --use_hpu_graphs --parallel_strategy "ep" --prompt "DeepSpeed is a machine learning framework"

DeepSeek-V2 BF16 8x expert parallelism throughput: 16.36

yao-matrix commented 1 week ago

@libinta @sywangyi , pls help review, thx.

yao-matrix commented 18 hours ago

@libinta , pls help review,thx.