Question: Multi-node training

casper-hansen commented 5 months ago

Hi @shawntan, great work on Scatter MoE. As newer models are scaling up in the number of parameters used, I wanted to ask a question about what you put in the README: does not include any additional multi-node training infrastructure code.

Other than using some tool, e.g. torch FSDP or DeepSpeed zero3, are there any further considerations you would make to ensure optimal performance of your kernels?

shawntan commented 5 months ago

What I meant by that was that unlike Megatron or Megablocks, we did not include any additional Expert Parallelism and related infrastructure code in this repo: It's a simple implementation of MoE. So the intention was for it to be used with FSDP, which is how I have been using it myself, and it should work with other parallelisation frameworks.

We do intend to eventually add Tensor Parallelism, but I'm kinda tied up at the moment.

One thing @yikangshen found was that, at least in the use cases we are looking at, expert parallelism wasn't very effective due to the different tensor sizes that needed to be communicated, so expert parallelism isn't in our roadmap.

As for the state of scattermoe as it is now, it seems to work best if your SMoE layer fits on your GPU, but it's mainly the 2 of us working on this, so it'll be great to know about other people's experiences as well.

casper-hansen commented 5 months ago

That makes sense. You would need a training framework around the actual model, which you would plug Scatter MoE into. I think it would be cool to see Scatter MoE implemented in something like pytorch/torchtune or other frameworks that does the actual training.

shawntan commented 5 months ago

I've submitted a pull request to huggingface/nanotron on their suggestion but I've heard nothing back since.

shawntan / scattermoe

Question: Multi-node training #11