Question about accuracy

Hi -- Thanks for the interest in our work, and try out the experiments!

Inconsistency in Table 5 and your evaluation results

Firstly, I would like to point out that Table 5 does not present the base model performance of llama-2-7b-hf. Instead, it demonstrates the performance of selecting the top 5% of the data using the gradients of the llama-2-7b-hf model for data selection purposes. The actual performance of the llama-2-7b-hf model can be found in the first column of Table 10. To further investigate the reported results, I ran the experiments again using an H100 GPU and obtained the following results:

	Run on H100	Reported in paper (Table 10)	Your result
BBH	38.5	38.3	40.4
TydiQA	47.5	46.4	42.5
MMLU	45.7	45.6	46.0

You can see that this new run still does not fully reproduce the results reported in the paper. Given that the results I ran now and reported in the paper were on different hardware and envs that I don’t have access to anymore, I am not sure what could be the problem. Also, it's worth noting that subtle variations in batch size can affect evaluation results, as discussed in this known issue.

Despite this, there remains a discrepancy between my latest run and your results. I have uploaded the code I used to obtain these results here. I hope this will help you in reproducing my results.

provided selected data and the results on the MMLU/TYDIQA/BBH are 50.0/54.8/41.1, and you get 50.2/56.2/41.5 in the table

The main issue seems to be on TydiQA. In retrospect, we realized that for TydiQA, we selected the third checkpoints instead of the last because we observed a significant performance degradation after the fourth epoch. This decision is consistent with our approach for random selection, hence we reported the results as of epoch=3 in the paper. The full results are presented below.

Ideally, a validation set should be used to determine the optimal checkpoints. To make this process more rigorous, one might consider forming a validation set from the training data of this dataset. However, we found that using just 9 examples as a validation set was too noisy for TydiQA.

	Epoch=1	Epoch=2	Epoch=3	Epoch=4
seed=3	56.4	56.8	56.4	53.9
seed=6	54.5	54.0	55.5	54.6
seed=9	54.7	54.6	56.8	54.9
Average	55.2	55.2	56.2	54.5

the results of our reproduction are 49.3/54.0/42.3, lower than the reported 50.2/56.2/41.5

For TydiQA, the issue might be addressed by selecting the appropriate checkpoints. As for MMLU, the reason it scores 1 point lower is unclear to me. It could be due to a large variance in runs 😕

Hope this is helpful, and let me know if you have further questions!

princeton-nlp / LESS

Question about accuracy #12