We do not include any training sets from commonly used benchmarks in our annealing data.
This enables us to assess the true few-shot learning capabilities and out-of-domain generalization of Llama 3.
Then, in the following paragraph:
Following OpenAI (2023a), we evaluate the efficacy of annealing on the GSM8k (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021b) training sets in annealing.
We find that annealing improved the performance of a pre-trained Llama 3 8B model on the GSM8k and MATH validation sets by 24.0% and 6.4%,
I am confused for GSM8k is also the public benchmark. I wonder what SFT training sets are pretrained or finetuned by your models. This is quite important for research on these SFT data. Maybe you should talk about used data in two versions: pretrained and instruct.
Hi! In 3.1.3 part of your paper, it reads:
Then, in the following paragraph:
I am confused for GSM8k is also the public benchmark. I wonder what SFT training sets are pretrained or finetuned by your models. This is quite important for research on these SFT data. Maybe you should talk about used data in two versions: pretrained and instruct.
Thanks for any help! @wukaixingxp