Closed ELigoP closed 2 months ago
Yes, you can put more layers on GPUs.
Puting extra 10 layers to 'cuda:0', for example, 0~9, the yaml about experts will be like this:
- match:
name: "^model\\.layers\\.(0|[1-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:0"
prefill_op: "KExpertsTorch"
generate_device: "cuda:0"
generate_op: "KExpertsTorch" # do remember using correct backend, KExpertsCPU only runable on cpu.
out_device: "cuda:0"
recursive: False # don't recursively inject submodules of this module
- match:
name: "^model\\.layers\\.([12][0-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:0"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:0"
recursive: False # don't recursively inject submodules of this module
- match:
name: "^model\\.layers\\.([345][0-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:1"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:1"
recursive: False # don't recursively inject submodules of this module
For more detail please read my tutorial. If you have any problem about tutorial, please feel free to ask!
When I replace default optimization rule
yaml
with multi-gpu version and start local_chat withit all works, but now GPUs utilization is about 50%:
I understand that this is already cutting edge optimization, and that this implementation attempts to place routing layers on GPU and experts on CPU. But maybe some more layers can still be put on GPUs? If some part of a single expert will be on GPU, it won't hurt, and that expert will be "hit", then there will be some improvement.