Open mars1248 opened 1 month ago
We are actually actively working on a MOE distributed training example, maybe. @alanwaketan can share more details.
Yea, will let you know once we have more information.
Yea, will let you know once we have more information.
@alanwaketan Can you tell me a little bit about your thinking? I want to express the experts in parallel in spmd, and then add custom calls to solve the routing problem of variable length tokens
torchxla spmd whether expert parallelism is supported? If it is a moe model, how should it be computed in xla?
❓ Questions and Help