Open sdmorrey opened 4 months ago
We chose Transformer++ as the base architecture for our language model because it serves as the foundation for many modern state-of-the-art models, such as LLaMa2/3, Mistral, Qwen, and Yi. These models have built upon the Transformer++ architecture with minor modifications, demonstrating its effectiveness and versatility. Moreover, recent research on linear transformers, such as Mamba and GLA, has utilized Transformer++ as the baseline for comparison. This further highlights the significance and relevance of the Transformer++ architecture in the field of natural language processing. The perceived underperformance of our model can be attributed to the limited training data compared to other models. For instance, Gemma was trained on 6 trillion tokens, while LLaMa 3 was trained on an impressive 15 trillion tokens, where we only have GPUs that is able to train 100B tokens. These numbers are clearly reported in their respective papers, and the training data used is not openly available. Although we have access to the fineweb corpus, which contains 15 trillion tokens, training a model on such a large dataset remains a challenging and resource-intensive task. It is estimated that renting the necessary H100 GPUs to train on this scale would cost nearly 1 million dollars. We are actively seeking support and contributions from the community to help us train our model on larger datasets and further improve its performance. If you are interested in contributing those computation resources, we would be immensely grateful for your support. ^_^
I found this project being discussed in local llama subreddit. I read the paper but had questions.
One of the questions that came up that is gnawing at me... Why Transformer++ as your basis of comparison? That model is basically from the Stone Age at this point.
Have you performed any comparisons with more recent SOTA models or against the frontier models?
Thanks!