Closed Fragile-azalea closed 2 years ago
Nop, only capacity_factor <= 0
would use that, but your point is great. For whatever cases, it currently allows the local batch size on each GPU to change in different forward steps, but should keep identical size with others in each of the corresponding step. For your case, I think we need an enhancement to handle what you need. Thanks!
@Fragile-azalea BTW, for your requirement (unbalanced input tokens), a tiny all_reduce in each step is unavoidable, since local padding may also work but it is usually slower for its following compute and all_to_all.
Considering most usual training doesn't have this requirement ever, we'll just add an extra flag for this which is disabled by default. Do you think it okay for you?
Nop, only
capacity_factor <= 0
would use that, but your point is great. For whatever cases, it currently allows the local batch size on each GPU to change in different forward steps, but should keep identical size with others in each of the corresponding step. For your case, I think we need an enhancement to handle what you need. Thanks!
Thank you for your quick response. If I want to set capacity_factor = 2.0 with the largest input tokens, Could I set capacity_factor = -2.0 to achieve the expected result?
@Fragile-azalea Only capacity_factor = 0 can guarantee no such problem exists. For other capacity_factor values, they are always or conditionally related to local scores.size(0)
, so that capacity result may become different for your case.
Hi, the latest commit allows inequivalent input tokens feeding to different GPUs, just by explicitly specifying inequivalent_tokens=True
in forward function which is False by default. You can always do that or specifying it only when you cannot guarantee. e.g. the last batch of the epoch.
def forward(self, ..):
..
y = self._moe_layer(x, inequivalent_tokens=True)
..
It seems to work well now! Thank you!
My code seems to hang when unbalanced workloads exist in two different GPUs(i.e. scores.size(0) is unequal in different GPUs such as, at the end of a dataset). It further leads to inequality in the capacity of Line 178 in different GPUs. Is simple_all_reduce also required for capacity_factor > 0 cases?
https://github.com/microsoft/tutel/blob/ceba363909a673203a356a71f0b1a6a9113a6845/tutel/impls/fast_dispatch.py#L177-L183