uclaml / SPIN

The official implementation of Self-Play Fine-Tuning (SPIN)
https://uclaml.github.io/SPIN/
Apache License 2.0
1.05k stars 92 forks source link

Evaluation results on MT Bench and BBH #5

Closed ftmtk closed 9 months ago

ftmtk commented 9 months ago

Hello,

Thanks for the great work! It seems that SPIN is effective in improving model performance on reasoning, math tasks (TruthfulQA, GSM8K, BB-Causal) but less so on knowledge task (MMLU).

I have a couple of questions to the authors:

  1. on which tasks in MT Bench do you observe contributing the most to the performance gain? I guess Math, reasoning, and extraction? 1.1. Do you plan to release the responses of your MT Bench results?
  2. have you tested on the full suite of BBH? Did you observe performance gain across all tasks or only the subsets that you reported in the paper?
  3. The results in figure 5 is intriguing: do you have results for other HF open leader board benchmarks, i.e., mmlu, gsm8k? Do you also observe the same phenomena in these benchmarks?

I would really appreciate if the authors could kindly respond to some of my questions.

Kind thanks, FengTing

yihedeng9 commented 9 months ago

Hi, thanks for your interest! Regarding the questions,

  1. The detailed performance of MT-Bench is illustrated in Figure 6 of our paper. We observed notable improvements in writing, STEM, and roleplay tasks. It's important to note that these performance gains vary depending on the specific SFT dataset that SPIN started with. Our use of the ultrachat dataset, which contains user-gpt dialogues and may be limited in certain fields, may explain the lesser improvement in other tasks. 1.1. Yes, we would release the MT Bench responses in the near future.
  2. Our study primarily concentrated on the Open LLM Leaderboard and we extended our analysis to additional tasks like MT Bench and selected BBH tasks. The choice of BBH tasks was random. Given the specific focus of our SFT dataset, it's possible that our model may not show improvements in certain tasks that are in fields not related to or covered by the SFT dataset.
  3. Yes, we observed the same for the other tasks, but did not included all of them at the time. These additional figures were not included in the current paper but will be incorporated in our next revision.

We appreciate your interest and hope this clarifies the questions :)