Closed AlpinDale closed 8 months ago
Hi! It seems that Yi only has two sizes, and it's not sufficient to estimate three constants (assuming the data used for Yi models is the same across scales) in the scaling law function. Instead, you can simply use the source model, i.e., Yi-34B's validation loss as your target loss. As shown in our paper, using the scaling law's predicted loss and the source model's loss leads to only slightly different results.
Thanks for the response @xiamengzhou. Is it mentioned in your paper how to use this scaling law? Because I'm unsure what it's referring to in the first place. How do I get the validation loss for Yi-34B? Do I have to train the model myself first and note the eval loss? I'm not sure I follow.
Please refer to Appendix A in the paper for more details on how to use the scaling law to predict the reference loss for different domains. For getting the validation loss for Yi-34B, you can simply use the Yi-34B to get the language modeling loss (cross entropy) on a validation dataset you have. There is no need to train your own models to get eval losses.
Any plans to release the code to compute target loss for source model?
Hi! Thanks for finally releasing the code. I've been trying to shear Yi-34B (after modifying it to be identical in architecture and tokenizer to Llama2) down to 20B. In the
pruning.sh
script, there's a target_loss that needs to be specified. What scaling law is it based on?