When should we apply hidden_z?

princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

https://arxiv.org/abs/2310.06694

MIT License

533 stars 39 forks source link

When should we apply hidden_z? #50

Closed sbwww closed 5 months ago

sbwww commented 7 months ago

I notice that hidden_z is applied in almost every module in every layer, and I'm curious about whether it will result in issue like gradient vanishing and exploding? And, will it have a large influence on the magnitude of the last hidden state, as the same scale is repeatedly multiplied?

xiamengzhou commented 7 months ago

Given how hidden_z is modeled (with hard concrete distributions), it's essentially learning a discrete binary value of 1 (retain the dimension) or 0 (prune the dimension). As the values strictly fall in side [0, 1], I don't think it will cause gradient explosion issues. For the same reason, it should not have an influence on the magnitude of the last hidden state. Let me know if it is clear!

sbwww commented 7 months ago

As the values strictly fall in side [0, 1]

Yes, the hidden state won't be too large, but will it be too small? Just plan to check it on the model without re-training.