Closed theblackcat102 closed 10 months ago
Hi there, thanks for using LLaMA-MoE~
score_scale_factor
is used for rescaling the value of hidden states after expert networks (Section 4.1 Scale Factor
in our technical report), which should be #total experts/#selected experts
.capacity_factor
is designed for Switch gate, but it is currently not used in LLaMA-MoE. We leave it in the config file for adapting other kinds of gating strategies.@Spico197 understood, thanks for the clarification.
Hi I got a questions regarding of these params found in config.json :
score_scale_factor
,capacity_factor
Base on my understanding, the llama-3B-MoE splits the intermediate dimension into 8 parts with each 1376 dimension instead of the original 11008 dimensions. Hence the 1376 in size_experts list. However I am wasn't quite understand the use of capacity_factor and score_scale_factor affects the architecture of MoE. Is this needed during inference or the 1376 numbers are derived based on capacity factor?
I have read expert construction readme but found no connections to setting these 2 values.
Am I missing something here?