microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
723 stars 93 forks source link

Is simple_all_reduce also required for capacity_factor > 0 cases? #173

Closed Fragile-azalea closed 2 years ago

Fragile-azalea commented 2 years ago

My code seems to hang when unbalanced workloads exist in two different GPUs(i.e. scores.size(0) is unequal in different GPUs such as, at the end of a dataset). It further leads to inequality in the capacity of Line 178 in different GPUs. Is simple_all_reduce also required for capacity_factor > 0 cases?

https://github.com/microsoft/tutel/blob/ceba363909a673203a356a71f0b1a6a9113a6845/tutel/impls/fast_dispatch.py#L177-L183

ghostplant commented 2 years ago

Nop, only capacity_factor <= 0 would use that, but your point is great. For whatever cases, it currently allows the local batch size on each GPU to change in different forward steps, but should keep identical size with others in each of the corresponding step. For your case, I think we need an enhancement to handle what you need. Thanks!

ghostplant commented 2 years ago

@Fragile-azalea BTW, for your requirement (unbalanced input tokens), a tiny all_reduce in each step is unavoidable, since local padding may also work but it is usually slower for its following compute and all_to_all.

Considering most usual training doesn't have this requirement ever, we'll just add an extra flag for this which is disabled by default. Do you think it okay for you?

Fragile-azalea commented 2 years ago

Nop, only capacity_factor <= 0 would use that, but your point is great. For whatever cases, it currently allows the local batch size on each GPU to change in different forward steps, but should keep identical size with others in each of the corresponding step. For your case, I think we need an enhancement to handle what you need. Thanks!

Thank you for your quick response. If I want to set capacity_factor = 2.0 with the largest input tokens, Could I set capacity_factor = -2.0 to achieve the expected result?

ghostplant commented 2 years ago

@Fragile-azalea Only capacity_factor = 0 can guarantee no such problem exists. For other capacity_factor values, they are always or conditionally related to local scores.size(0), so that capacity result may become different for your case.

ghostplant commented 2 years ago

Hi, the latest commit allows inequivalent input tokens feeding to different GPUs, just by explicitly specifying inequivalent_tokens=True in forward function which is False by default. You can always do that or specifying it only when you cannot guarantee. e.g. the last batch of the epoch.

def forward(self, ..):
  ..
  y = self._moe_layer(x, inequivalent_tokens=True)
  ..
Fragile-azalea commented 2 years ago

It seems to work well now! Thank you!