Closed YirunKCL closed 1 year ago
I've found the problem is the layernorm eps is too large. If you set it to something like 1e-10 instead of 1e-5, then the scale-invariant nature applies; Otherwise, the result is different for scaling.
I've found the problem is the layernorm eps is too large. If you set it to something like 1e-10 instead of 1e-5, then the scale-invariant nature applies; Otherwise, the result is different for scaling.
@YirunKCL Yes, it's caused by the layernorm's eps value. We can set eps to 0 for the parity check and set it to 1e-5 for training.
If setting it to 1e-5 for training, do we need to set it back to 0 when inference/evaluation for consistency?
If setting it to 1e-5 for training, do we need to set it back to 0 when inference/evaluation for consistency?
@YirunKCL The 1e-5 value is fine. It has negligible effects to inference.
It seems the recurrent and parallel forward results are quite inconsistent for multiscale retention in the RetNet code. By debugging around for a while, it seems these three lines are quite weird.
A:
mask = mask / mask.sum(dim=-1, keepdim=True).sqrt()
line 64 in the retnet.py B:kv = prev_kv * (1 - 1 / scale).view(self.num_heads, 1, 1) + kv / scale.view(self.num_heads, 1, 1)
line 108 in the multiscale_retention.py C:# kv = prev_kv * decay.view(self.num_heads, 1, 1) + kv
line 109 in the multiscale_retention.pyIf I remove A and B and uncomment C, the recurrent and parallel results become the same. Can you give me some explanation why these are used? Thanks!