microsoft / Tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
724 stars 93 forks source link

How about the cost of TUTEL features? #239

Open fyang064 opened 4 months ago

fyang064 commented 4 months ago

I'm wondering the cost of features mentioned in the TUTEL paper. It looks like the dynamic features including top-anything as well as the dynamic capacity factor will introduce the additional overhead. Do you have any analysis on this, especially the extra memory or computation cost and corresponding model performance benefit?

ghostplant commented 4 months ago

Hi. What you ask includes "model required cost" and "switching cost".

"Model-required cost" is the trivial cost needed to compute the model regardless of switching from another parallel configuration. Usually, this cost can be estimated by O(|capacity_factor| x |topK| x |model dim settings ..|). Thus, when you change capacity_factor and topK, the model-required cost should be also changed.

"Switching cost" is the extra cost when activating the change of parallel configurations from one to anther (e.g. tensor migration cost / checkpointing cost / program exchange cost / ..).

TUTEL's feature ensures it always keeps "Switching cost" for any configuration changes to be zero (regardless of warmup steps which performs once), while keeping "model-required cost" as model requested. e.g. If you change tensor parallel method or change the overlap granularities, the "model-required cost" keeps still. If you double top-k sparsity, the "model-required cost" will be doubled as well.