Adds a new op GGML_MULTI_ADD used to sum up the contributions of the selected experts. It results in, e.g., a 7% improvement of token generation speed for Granite-1B-MoE on CUDA (RTX-4080).
Fixes a massive inefficiency in the Metal implementation of MoE matrix multiplications (kernel_mul_mm_id). This leads to a nearly 6-fold prompt processing speedup for Granite-1B-MoE on Metal. But even for a much larger model such as Mixtral-8x7B the speedup is nearly a factor of 2 compared to current mainline llama.cpp (build: 8f275a7c (3989)).
This PR
GGML_MULTI_ADD
used to sum up the contributions of the selected experts. It results in, e.g., a 7% improvement of token generation speed for Granite-1B-MoE on CUDA (RTX-4080).kernel_mul_mm_id
). This leads to a nearly 6-fold prompt processing speedup for Granite-1B-MoE on Metal. But even for a much larger model such as Mixtral-8x7B the speedup is nearly a factor of 2 compared to current mainlinellama.cpp
(build:8f275a7c (3989)
).