How about the cost of TUTEL features?

Hi. What you ask includes "model required cost" and "switching cost".

"Model-required cost" is the trivial cost needed to compute the model regardless of switching from another parallel configuration. Usually, this cost can be estimated by O(|capacity_factor| x |topK| x |model dim settings ..|). Thus, when you change capacity_factor and topK, the model-required cost should be also changed.

"Switching cost" is the extra cost when activating the change of parallel configurations from one to anther (e.g. tensor migration cost / checkpointing cost / program exchange cost / ..).

TUTEL's feature ensures it always keeps "Switching cost" for any configuration changes to be zero (regardless of warmup steps which performs once), while keeping "model-required cost" as model requested. e.g. If you change tensor parallel method or change the overlap granularities, the "model-required cost" keeps still. If you double top-k sparsity, the "model-required cost" will be doubled as well.

microsoft / Tutel

How about the cost of TUTEL features? #239