openlm-research / open_llama

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset
Apache License 2.0
7.27k stars 370 forks source link

the v1 models and v2 models, performance doubt #70

Closed EthanChen1234 closed 1 year ago

EthanChen1234 commented 1 year ago

Dataset

The v1 models are trained on the RedPajama dataset. The v2 models are trained on a mixture of the Falcon refined-web dataset, the StarCoder dataset and the wikipedia, arxiv, book and stackexchange part of the RedPajama dataset. We follow the exactly same preprocessing steps and training hyperparameters as the original LLaMA paper, including model architecture, context length, training steps, learning rate schedule, and optimizer. The only difference between our setting and the original one is the dataset used: OpenLLaMA employs open datasets rather than the one utilized by the original LLaMA.

Evaluation

OpenLLaMA 7Bv2 average score is 0.56,while OpenLLaMA 7B average score is 0.55.

Doubts

the two model's performance is similar, do you have the deep analysis?

young-geng commented 1 year ago

We haven't comprehensively compared the two models. We are planning to do more comparisons soon.

gjmulder commented 1 year ago

Great work guys! I appreciate how complex and difficult an undertaking this is when you don't have the unlimited resources of a major tech company.

I also appreciate how frustrating it is when people log problems are obviously out of the scope of the project, or so poorly defined that it is more likely their process rather than the Open Llama model that is root cause of their issue.

EthanChen1234 commented 1 year ago

@gjmulder It's a great job, and I'm appreciate the efforts to train the LLM.

As you mentioned, the limited resources, which make us very careful to conduct the experiments. Compared the dataset category, the v2 is almost the same as v1.

If it is convenient for you, can you explain experimental purpose the v2?

young-geng commented 1 year ago

@EthanChen1234 The dataset of v2 is quite different, as it includes the entire StarCoder dataset, which makes 30% of the whole composition code. OpenLLaMA is not so much a research project, but an effort to make a good permissively licensed open source replacement of LLaMA. In this sense we are not planning to investigate a particular research question or write a paper with this project.

EthanChen1234 commented 1 year ago

@young-geng @gjmulder thanks.