Discrepency in code and paper related to HGRNBitAttention

The equations in the paper and the code don't match for the last equation.

The figure shows the last output equation as

But based on the current code. It looks like this is the execution

$o_t^{'} = RMSNORM(g_t) * \sigma(h_t)$

instead of

$o_t^{'} = RMSNORM(h_t) * \sigma(g_t)$

Seems like this is fixed in recent commit HGRN - flash-linear-attention repository

        last_state = (recurrent_state,)
        past_key_values.update(last_state, self.layer_idx, i.shape[2])

-       o = self.g_norm(self.g_proj(hidden_states), rearrange(o, 'b h l d -> b l (h d)'))
+       o = self.g_norm(rearrange(o, 'b h l d -> b l (h d)'), self.g_proj(hidden_states))
        o = self.o_proj(o)

        return o, None, past_key_values

Existing code path of current repository :

(g_norm is called with $g_t$ and $h_t$) : g_norm(self.g_proj(hidden_states), rearrange(o, 'b h l d -> b l (h d)'))

https://github.com/ridgerchu/matmulfreellm/blob/ec1c298ffa3db6436831f3e6d46f4e59d0b99194/mmfreelm/layers/hgrn_bit.py#L139
(X is $g_t$ and O is $h_t$) in FusedRMSNormSwishGate.forward(self, x, o, ...)

https://github.com/ridgerchu/matmulfreellm/blob/ec1c298ffa3db6436831f3e6d46f4e59d0b99194/mmfreelm/layers/hgrn_bit.py#L139
(sigmoid is called on O which is $h_t$ instead of $g_t$) in _layer_norm_fwd_1pass_kernel y = y o tl.sigmoid(o)

https://github.com/ridgerchu/matmulfreellm/blob/ec1c298ffa3db6436831f3e6d46f4e59d0b99194/mmfreelm/modules/fused_norm_gate.py#L128

Are the results with the inverted equation or with the fixed equation ?

ridgerchu / matmulfreellm

Discrepency in code and paper related to HGRNBitAttention #37

$o_t^{'} = RMSNORM(g_t) * \sigma(h_t)$

$o_t^{'} = RMSNORM(h_t) * \sigma(g_t)$