mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.99k stars 525 forks source link

MoE with FSDP #1197

Closed Muennighoff closed 4 months ago

Muennighoff commented 4 months ago

From looking at the code, I can train dMoEs with FSDP-only without using expert parallelism and have FSDP take care of sharding them, correct? IIUC to make EP work with FSDP you use the moe_world_size kwarg and then use a special device mesh and some other things. But it should be possible to scale to say a 8x7B model without EP and just using FSDP, no? Maybe @mvpatel2000 knows, thanks a lot!

mvpatel2000 commented 4 months ago

@Muennighoff Correct, you do not need to use EP and can rely on just FSDP. moe_world_size=1 would provide this behavior.

Otherwise, foundry will take care of passing in arguments and process groups if you pass in a larger moe_world_size, though you have to use device mesh for FSDP.