Questions about capacity_factor, score_scale_factor

theblackcat102 commented 10 months ago

Hi I got a questions regarding of these params found in config.json : score_scale_factor, capacity_factor

Base on my understanding, the llama-3B-MoE splits the intermediate dimension into 8 parts with each 1376 dimension instead of the original 11008 dimensions. Hence the 1376 in size_experts list. However I am wasn't quite understand the use of capacity_factor and score_scale_factor affects the architecture of MoE. Is this needed during inference or the 1376 numbers are derived based on capacity factor?

I have read expert construction readme but found no connections to setting these 2 values.

Am I missing something here?

Spico197 commented 10 months ago

Hi there, thanks for using LLaMA-MoE~

The score_scale_factor is used for rescaling the value of hidden states after expert networks (Section 4.1 Scale Factor in our technical report), which should be #total experts/#selected experts.
capacity_factor is designed for Switch gate, but it is currently not used in LLaMA-MoE. We leave it in the config file for adapting other kinds of gating strategies.

theblackcat102 commented 10 months ago

@Spico197 understood, thanks for the clarification.

pjlab-sys4nlp / llama-moe

Questions about capacity_factor, score_scale_factor #52