Default Initialization of Lambda Parameters to Zero

princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

https://arxiv.org/abs/2310.06694

MIT License

562 stars 47 forks source link

Default Initialization of Lambda Parameters to Zero #71

Open lpyhdzx opened 5 months ago

lpyhdzx commented 5 months ago

Hi! Great work! I have a question about the default value of the lambda params. I've noticed that they are initialized to zero by default: lambda_1_layer = torch.nn.Parameter(torch.tensor(0.0, device=self.device)) Given that the Lagrangian loss is calculated using these parameters as follows: lagrangian_loss = lambda_1 * (expected_sparsity - target_sparsity) + lambda_2 * (expected_sparsity - target_sparsity) ** 2 Initializing lambda_1 and lambda_2 to zero seems to imply that the Lagrangian loss component will be zero, as there would be no penalty for deviating from the target sparsity.

So, is it intended for the lambda parameters to be initialized to zero? or is there another section of the code where these parameters are set or adjusted after initialization? I appreciate any clarifications or insights you can provide on this matter.

xiamengzhou commented 5 months ago

Hi @lpyhdzx, sorry for the late reply!

Even though lambda_1 and lambda_2 are initialized to 0, and the lagrangian_loss is initially 0, the lambdas will still receive gradients during backpropagation. lambda_1 will get a gradient of (expected_sparsity - target_sparsity), and lambda_2 will get a gradient of (expected_sparsity - target_sparsity) ** 2. Therefore, these variables are still learnable.

lpyhdzx commented 5 months ago

Thanks for the reply! I borrowed this method but found that this loss would be optimized to be negative value. I guess that this is because there is no additional constraint on this lagrange loss and the parameter lambda can reach to negative values. lagrangian_loss = lambda_1 * (expected_sparsity - target_sparsity) + lambda_2 * (expected_sparsity - target_sparsity) ** 2, I'm not sure if there is any way to avoid this

Alloooshe commented 4 months ago

Thanks for the reply! I borrowed this method but found that this loss would be optimized to be negative value. I guess that this is because there is no additional constraint on this lagrange loss and the parameter lambda can reach to negative values. lagrangian_loss = lambda_1 * (expected_sparsity - target_sparsity) + lambda_2 * (expected_sparsity - target_sparsity) ** 2, I'm not sure if there is any way to avoid this

Hi! I am facing similar problem, lag_loss is negative and I am not sure if this will improve with additional training in my case the lambda_1, lambda_2 parameters are taking negative values and decreasing it would be great if you can share insights/advice on the matter

thank you!