princeton-nlp / LESS

[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
MIT License
306 stars 25 forks source link

Question about accuracy #12

Closed fyf3 closed 2 months ago

fyf3 commented 3 months ago

Hello, I have some questions about the accuracy of llama2-7b.

In the Table 5, the accuracy of llama2-7b-base on MMLU/TYDIQA/BBH are 46.7/52.1/39.8, but we use llama2-7b from "https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main" to test as 46.0/42.5/40.4, why is it so different from the table?

Also, we trained using the provided selected data and the results on the MMLU/TYDIQA/BBH are 50.0/54.8/41.1, the results of our reproduction are 49.3/54.0/42.3, lower than 50.2/56.2/41.5 in the table.

Can you kindly explain it to me? Thanks!

xiamengzhou commented 3 months ago

Hi -- Thanks for the interest in our work, and try out the experiments!

Inconsistency in Table 5 and your evaluation results

Firstly, I would like to point out that Table 5 does not present the base model performance of llama-2-7b-hf. Instead, it demonstrates the performance of selecting the top 5% of the data using the gradients of the llama-2-7b-hf model for data selection purposes. The actual performance of the llama-2-7b-hf model can be found in the first column of Table 10. To further investigate the reported results, I ran the experiments again using an H100 GPU and obtained the following results:

  Run on H100 Reported in paper (Table 10) Your result
BBH 38.5 38.3 40.4
TydiQA 47.5 46.4 42.5
MMLU 45.7 45.6 46.0

You can see that this new run still does not fully reproduce the results reported in the paper. Given that the results I ran now and reported in the paper were on different hardware and envs that I don’t have access to anymore, I am not sure what could be the problem. Also, it's worth noting that subtle variations in batch size can affect evaluation results, as discussed in this known issue.

Despite this, there remains a discrepancy between my latest run and your results. I have uploaded the code I used to obtain these results here. I hope this will help you in reproducing my results.

provided selected data and the results on the MMLU/TYDIQA/BBH are 50.0/54.8/41.1, and you get 50.2/56.2/41.5 in the table

The main issue seems to be on TydiQA. In retrospect, we realized that for TydiQA, we selected the third checkpoints instead of the last because we observed a significant performance degradation after the fourth epoch. This decision is consistent with our approach for random selection, hence we reported the results as of epoch=3 in the paper. The full results are presented below.

Ideally, a validation set should be used to determine the optimal checkpoints. To make this process more rigorous, one might consider forming a validation set from the training data of this dataset. However, we found that using just 9 examples as a validation set was too noisy for TydiQA.

  Epoch=1 Epoch=2 Epoch=3 Epoch=4
seed=3 56.4 56.8 56.4 53.9
seed=6 54.5 54.0 55.5 54.6
seed=9 54.7 54.6 56.8 54.9
Average 55.2 55.2 56.2 54.5

the results of our reproduction are 49.3/54.0/42.3, lower than the reported 50.2/56.2/41.5

For TydiQA, the issue might be addressed by selecting the appropriate checkpoints. As for MMLU, the reason it scores 1 point lower is unclear to me. It could be due to a large variance in runs 😕

Hope this is helpful, and let me know if you have further questions!