kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Apache License 2.0
641 stars 31 forks source link

More Efficient Layer Distribution for DeepSeek Coder v2 on Multiple GPUs and CPUs #49

Open BGFGB opened 3 weeks ago

BGFGB commented 3 weeks ago

Hi, I'm currently trying to run DeepSeek Coder v2 on a single node with the following setup:

Node 1: Two A6000 GPUs (48GB each) and 192GB of RAM
Node 2: Two 4090 GPUs (24GB each) and 64GB of RAM

At present, with the default configuration, the model only fully utilizes a single GPU with 24GB of VRAM. Alternatively, it can be split across two GPUs, but only using around 12GB on each, which seems suboptimal given the available resources. Wouldn’t it be more efficient if I could fully utilize more GPUs?

I can modify the configuration to allocate more layers to the GPUs, but this has been a trial-and-error process. Is there a more systematic approach or calculation method that could help guide me in allocating layers more efficiently across the available GPUs? Are there any recommended strategies for balancing the model layers on GPUs with different VRAM capacities?

Any guidance on how to better utilize GPU resources for faster inference would be greatly appreciated.

Thanks!

cts2021 commented 3 weeks ago

I have also encountered the same situation and will pay close attention.

ELigoP commented 3 weeks ago

This is covered in https://github.com/kvcache-ai/ktransformers/issues/46 , tutorial is https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/injection_tutorial.md

For now you need to write special .yaml optimization rule for your case.

Azure-Tang commented 3 weeks ago

Are you looking for detailed guidance on how to write a YAML configuration that maximizes GPU utilization? We will consider it.

Before our detailed tutorial, a practical starting point is to assess the tensor sizes in your gguf files. For instance, by examining the DeepseekV2 configuration, you can determine the shapes and data types of the tensors and estimate the VRAM they require.

Note, there are two things you have to pay attention to when calculating:

1.  If you are using KExpertsTorch or KLinearTorch as your backend, the weights will be dequantized model's default dtype, which is bf16 for deepseekV2.
2.  If your backend is Marlin, the weights will be dequantized to Q4 (you can also use Q8 by setting kwargs).
sammcj commented 2 weeks ago

This may not be quite right, but I'm thinking your config could look something like this:

2x A6000 GPUs (48GB each) and 192GB of RAM:

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cuda:0"
        prefill_device: "cuda:0"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cuda:0"
      generate_op:  "KExpertsTorch"
      out_device: "cuda:0"
  recursive: False
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:1"
      prefill_op: "KExpertsTorch"
      generate_device: "cuda:1"
      generate_op:  "KExpertsTorch"
      out_device: "cuda:1"
  recursive: False

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model$"
  replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
      per_layer_prefill_intput_threshold: 0
      transfer_map: 
        30: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "(^model\\.layers\\.([3-5][0-9])\\.)|(model.norm)|(lm_head)"
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

Node 2: Two 4090 GPUs (24GB each) and 64GB of RAM:

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cpu"
        prefill_device: "cpu"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:0"
  recursive: False
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:1"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:1"
  recursive: False

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model$"
  replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
      per_layer_prefill_intput_threshold: 0
      transfer_map: 
        30: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "(^model\\.layers\\.([3-5][0-9])\\.)|(^model.norm)|(^lm_head)"
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

To keep the embedding tokens on the GPUs, I think you'd modify the configuration for the model.embed_tokens match. Instead of assigning it to the CPU, assign it to one of the GPUs, typically the first one (cuda:0). Here's how we can modify that part:

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cuda:0"
        prefill_device: "cuda:0"