How to use Megablocks in MoE training

microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation

MIT License

694 stars 84 forks source link

How to use Megablocks in MoE training #236

Open CSCYQJ opened 2 months ago

CSCYQJ commented 2 months ago

I noticed that "Tutel v0.3: Add Megablocks solution to improve decoder inference on single-GPU with num_local_expert >= 2", but when I use megablocks in MoE training (dropless-MoE), the following error occurred: And I found the reason may be that torch.ops.tutel_ops.sparse_bmm_infer doesn't support backward operation.

ghostplant commented 2 months ago

Megablocks is disabled in training mode as the optimization isn't useful for models having single expert per GPU, especially for huge-scale training. So in training mode, please set megablocks_size=0 if self.training

Megablocks's two assumptions: (1) has to be > 1 local expert per GPU; (2) has to be imbalanced for local experts. Unless you want to train an imbalanced model on purpose by disabling balanced loss, Megablocks won't be helpful to training performance.