Closed clarence-lee-sheng closed 1 month ago
@clarence-lee-sheng Hello, thanks for your question! The parameter in our codebase / paper refers to the non-embedding parameters, as mentioned in the footnote 3 in page 2:
Our model sizes mentioned in this paper refer to the number of non-embedding parameters, as embedding parameters account for a disproportionately large portion in smaller models.
If you do not include the embedding parameters, the model sizes should be 1M and 60M, respectively.
Thank you, I have verified it and indeed after removing both the embedding and output layers I get approximately 1million and 60million parameters respectively
I am using the formula based on this and also the huggingface layer weights to calculate the number of parameters of your model: https://adithyask.medium.com/from-7b-to-8b-parameters-understanding-weight-matrix-changes-in-llama-transformer-models-31ea7ed5fd88
I am able to get around ~1.1b for tinyllama configuration, 8.03B for Llama 3.1 and 70B Llama3.1 estimated parameters when calculating the parameters, verifying the validity of my code, however, when using your configuration 1m and 60million parameter models, I am getting around 30 million parameters (for your 1million parameter config) and 136million (for your 60million config). I have also run your parameters in megatron-lm and I am getting similar values for number of parameters with my calculations. May I verify that the sizes of the models are correct for the 1million and 60million models?