Open BGFGB opened 3 months ago
I have also encountered the same situation and will pay close attention.
This is covered in https://github.com/kvcache-ai/ktransformers/issues/46 , tutorial is https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/injection_tutorial.md
For now you need to write special .yaml optimization rule for your case.
Are you looking for detailed guidance on how to write a YAML configuration that maximizes GPU utilization? We will consider it.
Before our detailed tutorial, a practical starting point is to assess the tensor sizes in your gguf files. For instance, by examining the DeepseekV2 configuration, you can determine the shapes and data types of the tensors and estimate the VRAM they require.
Note, there are two things you have to pay attention to when calculating:
1. If you are using KExpertsTorch or KLinearTorch as your backend, the weights will be dequantized model's default dtype, which is bf16 for deepseekV2.
2. If your backend is Marlin, the weights will be dequantized to Q4 (you can also use Q8 by setting kwargs).
This may not be quite right, but I'm thinking your config could look something like this:
2x A6000 GPUs (48GB each) and 192GB of RAM:
- match:
name: "^model.embed_tokens"
replace:
class: "default"
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([0-2][0-9])\\."
class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbedding
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([3-5][0-9])\\."
class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbedding
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
- match:
name: "^model\\.layers\\.([0-2][0-9])\\.(?!self_attn).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
- match:
name: "^model\\.layers\\.([3-5][0-9])\\.(?!self_attn).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
- match:
name: "^model\\.layers\\.([0-2][0-9])\\.mlp$"
class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
replace:
class: ktransformers.operators.experts.KDeepseekV2MoE
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([3-5][0-9])\\.mlp$"
class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
replace:
class: ktransformers.operators.experts.KDeepseekV2MoE
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
- match:
name: "^model\\.layers\\.([0-2][0-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:0"
prefill_op: "KExpertsTorch"
generate_device: "cuda:0"
generate_op: "KExpertsTorch"
out_device: "cuda:0"
recursive: False
- match:
name: "^model\\.layers\\.([3-5][0-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:1"
prefill_op: "KExpertsTorch"
generate_device: "cuda:1"
generate_op: "KExpertsTorch"
out_device: "cuda:1"
recursive: False
- match:
name: "^model\\.layers\\.([0-2][0-9])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([3-5][0-9])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
- match:
name: "^model$"
replace:
class: "ktransformers.operators.models.KDeepseekV2Model"
kwargs:
per_layer_prefill_intput_threshold: 0
transfer_map:
30: "cuda:1"
- match:
name: "^model\\.layers\\.([0-2][0-9])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "(^model\\.layers\\.([3-5][0-9])\\.)|(model.norm)|(lm_head)"
replace:
class: "default"
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
Node 2: Two 4090 GPUs (24GB each) and 64GB of RAM:
- match:
name: "^model.embed_tokens"
replace:
class: "default"
kwargs:
generate_device: "cpu"
prefill_device: "cpu"
- match:
name: "^model\\.layers\\.([0-2][0-9])\\."
class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbedding
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([3-5][0-9])\\."
class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbedding
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
- match:
name: "^model\\.layers\\.([0-2][0-9])\\.(?!self_attn).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
- match:
name: "^model\\.layers\\.([3-5][0-9])\\.(?!self_attn).*$"
class: torch.nn.Linear
replace:
class: ktransformers.operators.linear.KTransformersLinear
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
- match:
name: "^model\\.layers\\.([0-2][0-9])\\.mlp$"
class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
replace:
class: ktransformers.operators.experts.KDeepseekV2MoE
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([3-5][0-9])\\.mlp$"
class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
replace:
class: ktransformers.operators.experts.KDeepseekV2MoE
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
- match:
name: "^model\\.layers\\.([0-2][0-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:0"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:0"
recursive: False
- match:
name: "^model\\.layers\\.([3-5][0-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
prefill_device: "cuda:1"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:1"
recursive: False
- match:
name: "^model\\.layers\\.([0-2][0-9])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([3-5][0-9])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
- match:
name: "^model$"
replace:
class: "ktransformers.operators.models.KDeepseekV2Model"
kwargs:
per_layer_prefill_intput_threshold: 0
transfer_map:
30: "cuda:1"
- match:
name: "^model\\.layers\\.([0-2][0-9])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "(^model\\.layers\\.([3-5][0-9])\\.)|(^model.norm)|(^lm_head)"
replace:
class: "default"
kwargs:
generate_device: "cuda:1"
prefill_device: "cuda:1"
To keep the embedding tokens on the GPUs, I think you'd modify the configuration for the model.embed_tokens match. Instead of assigning it to the CPU, assign it to one of the GPUs, typically the first one (cuda:0). Here's how we can modify that part:
- match:
name: "^model.embed_tokens"
replace:
class: "default"
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
Hi, I'm currently trying to run DeepSeek Coder v2 on a single node with the following setup:
At present, with the default configuration, the model only fully utilizes a single GPU with 24GB of VRAM. Alternatively, it can be split across two GPUs, but only using around 12GB on each, which seems suboptimal given the available resources. Wouldn’t it be more efficient if I could fully utilize more GPUs?
I can modify the configuration to allocate more layers to the GPUs, but this has been a trial-and-error process. Is there a more systematic approach or calculation method that could help guide me in allocating layers more efficiently across the available GPUs? Are there any recommended strategies for balancing the model layers on GPUs with different VRAM capacities?
Any guidance on how to better utilize GPU resources for faster inference would be greatly appreciated.
Thanks!