Closed Muennighoff closed 4 months ago
@Muennighoff Correct, you do not need to use EP and can rely on just FSDP. moe_world_size=1
would provide this behavior.
Otherwise, foundry will take care of passing in arguments and process groups if you pass in a larger moe_world_size
, though you have to use device mesh for FSDP.
From looking at the code, I can train dMoEs with FSDP-only without using expert parallelism and have FSDP take care of sharding them, correct? IIUC to make EP work with FSDP you use the
moe_world_size
kwarg and then use a special device mesh and some other things. But it should be possible to scale to say a 8x7B model without EP and just using FSDP, no? Maybe @mvpatel2000 knows, thanks a lot!