Closed rom1504 closed 1 year ago
GPT-4 is predisposed to prefer models trained on data bootstrapped using InstructGPT/GPT-4/ChatGPT over more factual and useful content.
To the best of my knowledge, @imoneoi mentioned that he was working on alternative benchmarks, including MMLU. I believe he reached just shy of 50% on MMLU, whereas GPT3.5 reaches 70%. (https://arxiv.org/abs/2303.08774)
Thanks for your comments. Exactly, we are in the progress to verify the real reason behind this interesting phenomenon. The model is powerful or just the evaluation is bias? It is worth noting that there were also other SFT models fine-tuned on GPT-4 generated data, but our performance is much better. It would be helpful if you have any insights!
[2023/07] We released the OpenLLMs model series. Among them, OpenChat obtains 80.9% win-rate on AlpacaEval and 105% ChatGPT performance on Vicuna GPT-4 evaluation.
Are you saying your model is generally better than a 10x bigger model?
If not, what is the plan to fix metrics so they show the expected ranking?
Note that AlpacaEval is a highly recognized benchmark in the community, with many well-known models trained under SFT evaluated on it (Vicuna, Alpaca, UltraLM, etc) achieving expected ranking. This benchmark can provide valuable insights into the performance of data-centric methods to a certain extent.
However, we acknowledge that the metrics are not perfect, and recent studies have shown that the win rate under GPT-4 evaluation is correlated with the length of unique tokens. Auto-evaluating a chatbot is a daunting task, and the community has made a long-term effort to improve language generation evaluation. We will continue to consider this issue, refrain from focusing solely on the ranking, and explore other evaluation methods including human evaluation. Thank you for your valuable suggestions.
Thank you for your interest in our OpenLLMs model series, and for bringing up important questions about our performance metrics.
To clarify, we believe that OpenChat is a powerful alignment method that demonstrates the importance of data quality and diversity in achieving high performance. While we acknowledge that there is still a gap between OpenChat and ChatGPT, as shown by the 80.7% vs 86.1% on AlpacaEval, we are proud that despite being limited to 6K data, OpenChat still performs better than other open-source models that use 0.2~1M data.
We understand that the evaluation of language generation models is a complex and ongoing challenge, and we are committed to exploring different evaluation methods, including human evaluation, to gain a deeper understanding of our model's strengths and limitations.
We are also continuously working to improve our alignment methods and training larger models to achieve even better performance. We are excited to release OpenChat V2 and V3 in the future and will keep the community updated on our progress.
Thank you again for your valuable suggestions and for your interest in our work.
Awesome. I'm glad to see that you're aware of the perils of automated evaluation of generative models. I trust that your work here, as well as future versions, are indeed as good as reported. I'm quite interested in this project personally, and I'd love to find some time to contribute in the future.
Notice there's this new MT-bench that is supposed to replace the Vicuna GPT-4 eval and shows a bigger gap between open models and GPT-3.5: https://twitter.com/lmsysorg/status/1675612625273761793
Are you saying your model is generally better than a 10x bigger model?
If not, what is the plan to fix metrics so they show the expected ranking?