Open loki-r opened 3 months ago
Hi, I think it is a bug, due to the HGRN api modifications. the sigma should be applied to g_t for better performance, but now it is applied to h_t. and our pre-trained model also still using sigma h_t... we will fix this problem in arxiv soon, and we believe that if applied to g_t it would be better performance compared with our current version.
The equations in the paper and the code don't match for the last equation.
The figure shows the last output equation as
But based on the current code. It looks like this is the execution
$o_t^{'} = RMSNORM(g_t) * \sigma(h_t)$
instead of
$o_t^{'} = RMSNORM(h_t) * \sigma(g_t)$
Seems like this is fixed in recent commit HGRN - flash-linear-attention repository
Existing code path of current repository :
(g_norm is called with $g_t$ and $h_t$) : g_norm(self.g_proj(hidden_states), rearrange(o, 'b h l d -> b l (h d)'))
(X is $g_t$ and O is $h_t$) in FusedRMSNormSwishGate.forward(self, x, o, ...)
(sigmoid is called on O which is $h_t$ instead of $g_t$) in _layer_norm_fwd_1pass_kernel y = y o tl.sigmoid(o)
Are the results with the inverted equation or with the fixed equation ?