Scaling Law for predicted loss

princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

https://arxiv.org/abs/2310.06694

MIT License

539 stars 41 forks source link

Scaling Law for predicted loss #13

Closed AlpinDale closed 8 months ago

AlpinDale commented 10 months ago

Hi! Thanks for finally releasing the code. I've been trying to shear Yi-34B (after modifying it to be identical in architecture and tokenizer to Llama2) down to 20B. In the pruning.sh script, there's a target_loss that needs to be specified. What scaling law is it based on?

xiamengzhou commented 10 months ago

Hi! It seems that Yi only has two sizes, and it's not sufficient to estimate three constants (assuming the data used for Yi models is the same across scales) in the scaling law function. Instead, you can simply use the source model, i.e., Yi-34B's validation loss as your target loss. As shown in our paper, using the scaling law's predicted loss and the source model's loss leads to only slightly different results.

AlpinDale commented 10 months ago

Thanks for the response @xiamengzhou. Is it mentioned in your paper how to use this scaling law? Because I'm unsure what it's referring to in the first place. How do I get the validation loss for Yi-34B? Do I have to train the model myself first and note the eval loss? I'm not sure I follow.

xiamengzhou commented 10 months ago

Please refer to Appendix A in the paper for more details on how to use the scaling law to predict the reference loss for different domains. For getting the validation loss for Yi-34B, you can simply use the Yi-34B to get the language modeling loss (cross entropy) on a validation dataset you have. There is no need to train your own models to get eval losses.

TonyZhanghm commented 9 months ago

Any plans to release the code to compute target loss for source model?