sail-sg / regmix

🧬 RegMix: Data Mixture as Regression for Language Model Pre-training
MIT License
81 stars 4 forks source link

Parameters and sizes of models mismatch #7

Closed clarence-lee-sheng closed 1 month ago

clarence-lee-sheng commented 1 month ago

I am using the formula based on this and also the huggingface layer weights to calculate the number of parameters of your model: https://adithyask.medium.com/from-7b-to-8b-parameters-understanding-weight-matrix-changes-in-llama-transformer-models-31ea7ed5fd88

I am able to get around ~1.1b for tinyllama configuration, 8.03B for Llama 3.1 and 70B Llama3.1 estimated parameters when calculating the parameters, verifying the validity of my code, however, when using your configuration 1m and 60million parameter models, I am getting around 30 million parameters (for your 1million parameter config) and 136million (for your 60million config). I have also run your parameters in megatron-lm and I am getting similar values for number of parameters with my calculations. May I verify that the sizes of the models are correct for the 1million and 60million models?

SivilTaram commented 1 month ago

@clarence-lee-sheng Hello, thanks for your question! The parameter in our codebase / paper refers to the non-embedding parameters, as mentioned in the footnote 3 in page 2:

Our model sizes mentioned in this paper refer to the number of non-embedding parameters, as embedding parameters account for a disproportionately large portion in smaller models.

If you do not include the embedding parameters, the model sizes should be 1M and 60M, respectively.

clarence-lee-sheng commented 1 month ago

Thank you, I have verified it and indeed after removing both the embedding and output layers I get approximately 1million and 60million parameters respectively