microsoft / TransformerCompression

For releasing code related to compression methods for transformers, accompanying our publications
MIT License
354 stars 31 forks source link

Question of `RMSNorm`'s `forward` function #131

Closed zhaoyang-star closed 4 months ago

zhaoyang-star commented 4 months ago

I noticed that RMSNorm's hidden_size does not been changed to the sliced dim. So In the forward of RMSNorm the input tensor's shape is [bs, seq_len, sliced_hidden_size], while the variance is caculated by variance = x.pow(2).sum(-1, keepdim=True) / self.mean_dim, where self.mean_dim is equal to hidden_size.

I think self.mean_dim should be equal to the sliced hidden_size. Please correct me if I misunderstand anything. Thx.

    replace_modules(
        model_adapter.model,
        model_adapter.original_layer_norm_type,
        lambda _: RMSN(model_adapter.hidden_size),
        replace_layers=False,
    )
    logging.info("Fusing layernorm modules done")
### Tasks
nailimixaM commented 4 months ago

Very interesting question - I think it makes sense to keep as is, because by discarding eigenvalues from x's PCA we are effectively setting some x values to zero, which allows us to slice them off. So from this perspective, if we imagine keeping the zeros in x then we calculate the variance of x in RMSN by summing over non-zero and zero elements, then divide by the total (original) dim. Curious to know if @myshkov @jameshensman agree.

Appreciating this diligence, @zhaoyang-star!

zhaoyang-star commented 4 months ago

@nailimixaM Thanks for your explaination. You are right. And I verified that the ppl of the sliced model will increase if replacing self.mean_dim with the sliced hidden_size.