Open Luodian opened 2 years ago
@zeliu98
Hi I think this happens because the default gate_noise
value in load_importance_loss
is 0.0
.
And if we do
normal = Normal(0, 0.0)
it's weird, why we have a normal distribution with zero variance? and it returns
*** ValueError: Expected parameter scale (Tensor of shape ()) of distribution Normal(loc: 0.0, scale: 0.0) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
0.0
If I preset gate_noise to 1.0, I think the code run without problems but I am not sure if it's numerically correct?
gate_type={'type': 'top', 'k': 2, 'fp32_gate': False, 'gate_noise': 1.0, },
Hi @Luodian, yes, you need to set gate_noise>0
for load_importance_loss. You can find the reasons in APPENDICES A: LOAD-BALANCING LOSS
in the original paper (https://arxiv.org/pdf/1701.06538.pdf).
@zeliu98 We need to add assertion reason to avoid unknowns error like this.
And thanks for your information! @Luodian
Yep, and I also found an issue when using cosine projector.
It seems that in cosine_top.py
line 31, there should be an .cuda() or .to(device) flag to make sure the tensor in same device.
logit_scale = torch.clamp(self.temperature, max=torch.log(torch.tensor(1. / 0.01)).cuda()).exp()
We have added gate_noise
assertion and device cast in latest commit. Thanks for pointing out this bug.
Hi I had the errors when using
load_importance_loss
(the code works fine when usinggshard_loss
). Does anyone have an idea about it?The error log (in one rank/node) is in below: