microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
3.01k stars 202 forks source link

Inconsist recurrent and parallel results for RetNet #40

Closed YirunKCL closed 1 year ago

YirunKCL commented 1 year ago

It seems the recurrent and parallel forward results are quite inconsistent for multiscale retention in the RetNet code. By debugging around for a while, it seems these three lines are quite weird.

A: mask = mask / mask.sum(dim=-1, keepdim=True).sqrt() line 64 in the retnet.py B: kv = prev_kv * (1 - 1 / scale).view(self.num_heads, 1, 1) + kv / scale.view(self.num_heads, 1, 1) line 108 in the multiscale_retention.py C: # kv = prev_kv * decay.view(self.num_heads, 1, 1) + kv line 109 in the multiscale_retention.py

If I remove A and B and uncomment C, the recurrent and parallel results become the same. Can you give me some explanation why these are used? Thanks!

YirunKCL commented 1 year ago

I've found the problem is the layernorm eps is too large. If you set it to something like 1e-10 instead of 1e-5, then the scale-invariant nature applies; Otherwise, the result is different for scaling.

donglixp commented 1 year ago

I've found the problem is the layernorm eps is too large. If you set it to something like 1e-10 instead of 1e-5, then the scale-invariant nature applies; Otherwise, the result is different for scaling.

@YirunKCL Yes, it's caused by the layernorm's eps value. We can set eps to 0 for the parity check and set it to 1e-5 for training.

YirunKCL commented 1 year ago

If setting it to 1e-5 for training, do we need to set it back to 0 when inference/evaluation for consistency?

donglixp commented 1 year ago

If setting it to 1e-5 for training, do we need to set it back to 0 when inference/evaluation for consistency?

@YirunKCL The 1e-5 value is fine. It has negligible effects to inference.