I think it is better to limit the range of cosine like:
cosine = F.linear(F.normalize(input), F.normalize(self.weight)).clamp(-1+eps,1-eps),
because in my experiment, when cosine(theta) == 1 the loss would become NaN.
i know it's very late, but the problem is - you have to normalize self.weight by different dimension in order to obtain real cosine:
cosine = F.linear(F.normalize(input), F.normalize(self.weight, dim=0))
I think it is better to limit the range of cosine like: cosine = F.linear(F.normalize(input), F.normalize(self.weight)).clamp(-1+eps,1-eps), because in my experiment, when cosine(theta) == 1 the loss would become NaN.