Open Heihaierr opened 8 months ago
That is a good point. I think you are right. Can you please open a pull request on this? Thanks.
BTW, I am also wondering if the capacity calculation in GShardGate
is wrong. @zms1999
Hi, guys!
Thanks for your fantastic work.
I met a problem when I use class SwitchGate
, can you take a look at it for me?
The following is my code:
import torch
from fmoe.gates import *
device = torch.device("cuda:0")
sg = SwitchGate(d_model=64, num_expert=5, world_size=2)
sg = sg.to(device)
input = torch.rand(128, 64) # (batch_size, d_model)
input = input.to(device)
idx, val = sg(input)
print(idx, idx.shape)
print(val, val.shape)
Parameter word_size
can only set to 1
, or it will occur the error "Segmentation fault (core dumped)".
@Peg-Wu As you are not using torch distributed, world_size
has to be 1
.
谢谢您的回复~
如果我想用DDP进行加速, 我应该怎样修改代码, 可以使用pytorch官方的DDP并行吗
非常感谢!
Describe the bug In
fmoe/gates/switch_gate.py
line 45:capacity = math.ceil(cap_rate * inp.shape[0])
should be:
capacity = math.ceil(cap_rate * inp.shape[0] / self.num_expert)
?