More Efficient Layer Distribution for DeepSeek Coder v2 on Multiple GPUs and CPUs

BGFGB commented 3 months ago

Hi, I'm currently trying to run DeepSeek Coder v2 on a single node with the following setup:

Node 1: Two A6000 GPUs (48GB each) and 192GB of RAM
Node 2: Two 4090 GPUs (24GB each) and 64GB of RAM

At present, with the default configuration, the model only fully utilizes a single GPU with 24GB of VRAM. Alternatively, it can be split across two GPUs, but only using around 12GB on each, which seems suboptimal given the available resources. Wouldn’t it be more efficient if I could fully utilize more GPUs?

I can modify the configuration to allocate more layers to the GPUs, but this has been a trial-and-error process. Is there a more systematic approach or calculation method that could help guide me in allocating layers more efficiently across the available GPUs? Are there any recommended strategies for balancing the model layers on GPUs with different VRAM capacities?

Any guidance on how to better utilize GPU resources for faster inference would be greatly appreciated.

Thanks!

cts2021 commented 3 months ago

I have also encountered the same situation and will pay close attention.

ELigoP commented 3 months ago

This is covered in https://github.com/kvcache-ai/ktransformers/issues/46 , tutorial is https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/injection_tutorial.md

For now you need to write special .yaml optimization rule for your case.

Azure-Tang commented 3 months ago

Are you looking for detailed guidance on how to write a YAML configuration that maximizes GPU utilization? We will consider it.

Before our detailed tutorial, a practical starting point is to assess the tensor sizes in your gguf files. For instance, by examining the DeepseekV2 configuration, you can determine the shapes and data types of the tensors and estimate the VRAM they require.

Note, there are two things you have to pay attention to when calculating:

1.  If you are using KExpertsTorch or KLinearTorch as your backend, the weights will be dequantized model's default dtype, which is bf16 for deepseekV2.
2.  If your backend is Marlin, the weights will be dequantized to Q4 (you can also use Q8 by setting kwargs).

sammcj commented 2 months ago

This may not be quite right, but I'm thinking your config could look something like this:

Layer Distribution:
- GPU 0 (cuda:0): Layers 0-29 (30 layers)
- GPU 1 (cuda:1): Layers 30-59 (30 layers) + model.norm and lm_head
The model is evenly split between the two GPUs, each handling half of the layers. This should provide good balance and parallelism.
The transfer_map in the ^model$ match is set to transfer at layer 30, which is the midpoint of the model.
Both GPUs are utilized for generation and prefill operations, maximizing the use of available hardware.
The embedding tokens are kept on the CPU to save GPU memory.
Expert parallelism is maintained, with experts being computed on the CPU and then transferred to the respective GPU.

2x A6000 GPUs (48GB each) and 192GB of RAM:

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cuda:0"
        prefill_device: "cuda:0"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cuda:0"
      generate_op:  "KExpertsTorch"
      out_device: "cuda:0"
  recursive: False
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:1"
      prefill_op: "KExpertsTorch"
      generate_device: "cuda:1"
      generate_op:  "KExpertsTorch"
      out_device: "cuda:1"
  recursive: False

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model$"
  replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
      per_layer_prefill_intput_threshold: 0
      transfer_map: 
        30: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "(^model\\.layers\\.([3-5][0-9])\\.)|(model.norm)|(lm_head)"
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

Node 2: Two 4090 GPUs (24GB each) and 64GB of RAM:

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cpu"
        prefill_device: "cpu"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:0"
  recursive: False
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:1"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:1"
  recursive: False

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model$"
  replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
      per_layer_prefill_intput_threshold: 0
      transfer_map: 
        30: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "(^model\\.layers\\.([3-5][0-9])\\.)|(^model.norm)|(^lm_head)"
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

To keep the embedding tokens on the GPUs, I think you'd modify the configuration for the model.embed_tokens match. Instead of assigning it to the CPU, assign it to one of the GPUs, typically the first one (cuda:0). Here's how we can modify that part:

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cuda:0"
        prefill_device: "cuda:0"

kvcache-ai / ktransformers

More Efficient Layer Distribution for DeepSeek Coder v2 on Multiple GPUs and CPUs #49