[ENHANCEMENT] improve GPU utilization for multi-GPU

When I replace default optimization rule yaml with multi-gpu version and start local_chat with

python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2-Chat-0628 --gguf_path ~/.cache/lm-studio/models/bartowski/DeepSeek-V2-Chat-0628-GGUF --optimize_rule_path ./build/lib.linux-x86_64-cpython-312/ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu.yaml

it all works, but now GPUs utilization is about 50%:

I understand that this is already cutting edge optimization, and that this implementation attempts to place routing layers on GPU and experts on CPU. But maybe some more layers can still be put on GPUs? If some part of a single expert will be on GPU, it won't hurt, and that expert will be "hit", then there will be some improvement.

Yes, you can put more layers on GPUs.

Puting extra 10 layers to 'cuda:0', for example, 0~9, the yaml about experts will be like this:


- match:
    name: "^model\\.layers\\.(0|[1-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cuda:0"
      generate_op:  "KExpertsTorch" # do remember using correct backend, KExpertsCPU only runable on cpu.
      out_device: "cuda:0"
  recursive: False # don't recursively inject submodules of this module

- match:
    name: "^model\\.layers\\.([12][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:0"
  recursive: False # don't recursively inject submodules of this module

- match:
    name: "^model\\.layers\\.([345][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
    kwargs:
      prefill_device: "cuda:1"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:1"
  recursive: False # don't recursively inject submodules of this module

For more detail please read my tutorial. If you have any problem about tutorial, please feel free to ask!

kvcache-ai / ktransformers

[ENHANCEMENT] improve GPU utilization for multi-GPU #46