microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
694 stars 84 forks source link

how to use tutel on Megatron Deepspeed #207

Open wangyuxin87 opened 1 year ago

wangyuxin87 commented 1 year ago

can tutel be used with Megatron Deepspeed?

ghostplant commented 1 year ago

Do you mean Megatron and Deepspeed respectively, or working together for them all?

xcwanAndy commented 3 months ago

@ghostplant Can tutel work concurrently with Megatron or Deepspeed respectively?

ghostplant commented 3 months ago

Yes, Tutel is just an MoE layer implementation which is pluggable for any distributed frameworks. The way for other framework to use Tutel MoE layer is by passing distributed processing group properly, e.g.:

my_processing_group = deepspeed.new_group(..)

moe_layer = tutel_moe.moe_layer(
    ..,
    group=my_processing_group
)

If other frameworks are not available, Tutel itself also provides a 1-line initialization to generate groups you need, which works for both distributed gpu (i.e. nccl) and distributed cpu (i.e. gloo):

from tutel import system
parallel_env = system.init_data_model_parallel(backend='nccl' if args.device == 'cuda' else 'gloo')
my_processing_group = [ parallel_env.data_group | parallel_env.model_group | parallel_env.global_group ]
...
xcwanAndy commented 3 months ago

Thanks for your prompt response!