Open NouamaneTazi opened 7 months ago
Our current MoE implementation only works with tp_mode="ALL_REDUCE". We should fix the implementation when using tp_mode="REDUCE_SCATTER" to support sequence parallelism
tp_mode="ALL_REDUCE"
tp_mode="REDUCE_SCATTER"
Our current MoE implementation only works with
tp_mode="ALL_REDUCE"
. We should fix the implementation when usingtp_mode="REDUCE_SCATTER"
to support sequence parallelism