sidhantls / adaptive-rank-selection-svd

Implementation of Adaptive Rank Selections for Low-Rank Approximation of Language Models
2 stars 1 forks source link

Hypernet Linear Implementation #3

Open gaosh opened 3 weeks ago

gaosh commented 3 weeks ago

I found another issue compared to the original implementation. The Hypernetwork https://github.com/sidhantls/adaptive-rank-selection-svd/blob/main/utils/adaptive_rank_selection.py#L35 should have a separate linear layer for each layer, which will show a larger rank difference for different layers after learning.

sidhantls commented 3 weeks ago

@gaosh Thanks for pointing this out. Can you clarify what you mean? Currently, within the Hypernetwork, I see that there is already a linear layer after the GRU - https://github.com/sidhantls/adaptive-rank-selection-svd/blob/0963edebd85be5f23adeb556292e957d054c448d/utils/adaptive_rank_selection.py#L56. This linear layer is different for each hypernetwork. Do you mean there has to be an additional linear layer after this one?

The hypernetwork architecture in the code follows this architecture: GRU -> Layer Norm -> Activation -> Linear. This alligns with Table A.1 in Appendix B. Can you re-clarify what is missing, it will be very helpful

gaosh commented 3 weeks ago

Hello, I check the code again. Seems the current implementation assigns a hypernetwork for each low rank linear layer as in this line https://github.com/sidhantls/adaptive-rank-selection-svd/blob/0963edebd85be5f23adeb556292e957d054c448d/utils/adaptive_rank_selection.py#L112 In the paper, we use a single hypernetwork for all low rank linear layers. Suppose you have L linear weight matrices, the input self.z for the hypernetwork will have shape (1, L, input_size), where L will be the sequence length for GRU. And we have L Linear layers after the outputs of GRU. I hope this clarifies the implementation of the hypernetwork. You can find an example from my another project here: https://github.com/xidongwu/AutoTrainOnce/blob/main/imgnet_models/hypernet.py#L81

sidhantls commented 3 weeks ago

Ah I see, I understand now. I had misinterpreted this.

"A single hyper network for all low-rank layers": The paper defines the hypernetwork as Bi-GRU → LayerNorm→ GeLU → Linear. However, what you mean is that only the Bi-GRU → LayerNorm→ GeLU portion of the hypernetwork is implemented once for all layers. And Linear is unique for each linear layer?

Thanks for sharing this, it's helpful to ensure the reproducing of results is accurate. I'll update the repo with this implementation

sidhantls commented 3 weeks ago

Thanks for the feedback. I updated the implementation in this branch:

  1. One GRU Hypernet that computes hidden states for Linear layers: https://github.com/sidhantls/adaptive-rank-selection-svd/blob/fd1bd2c19c1c1289f99215f31fadbe032169acce/train_adaptive.py#L244
  2. Predict mask using different linear layers https://github.com/sidhantls/adaptive-rank-selection-svd/blob/fd1bd2c19c1c1289f99215f31fadbe032169acce/utils/adaptive_rank_selection.py#L163