Open fyang064 opened 4 months ago
Hi. What you ask includes "model required cost" and "switching cost".
"Model-required cost" is the trivial cost needed to compute the model regardless of switching from another parallel configuration. Usually, this cost can be estimated by O(|capacity_factor| x |topK| x |model dim settings ..|). Thus, when you change capacity_factor and topK, the model-required cost should be also changed.
"Switching cost" is the extra cost when activating the change of parallel configurations from one to anther (e.g. tensor migration cost / checkpointing cost / program exchange cost / ..).
TUTEL's feature ensures it always keeps "Switching cost" for any configuration changes to be zero (regardless of warmup steps which performs once), while keeping "model-required cost" as model requested. e.g. If you change tensor parallel method or change the overlap granularities, the "model-required cost" keeps still. If you double top-k sparsity, the "model-required cost" will be doubled as well.
I'm wondering the cost of features mentioned in the TUTEL paper. It looks like the dynamic features including top-anything as well as the dynamic capacity factor will introduce the additional overhead. Do you have any analysis on this, especially the extra memory or computation cost and corresponding model performance benefit?