I'm trying to reproduce the summaries generated by HF models, namely Phi-2, and Llama 3.2-1B instruct, since the result I'm getting following the described prompt / pipeline is not close to the one in the leaderboard. Comparing the summary generation im getting with the one in the hf dataset, I found that theres a large difference. One thing those models are sturggling with for example that I don't see in HF dataset was sentence reptition.
So my question is is generation config used for hugginface models ? I'm currently using the text-gerenation pipelien setting do_sample=False (as i found mentioned in another issue that 0 temperature was used), if code can be provided it could be also helpeful to see whats giving raise to this variation in results.
Edit, I also can't reproduce the leadeboard score for Llama 3.2-1B using the generated summaries in the HF dataset linked, this is because
1 - I don't know what threshold was used to determine if a response is hallucinated / consisten or not edit : I will use top_k = 1
2 - The dataset includes the ommited samples (length is 1006 and not 850ish)
I'm trying to reproduce the summaries generated by HF models, namely Phi-2, and Llama 3.2-1B instruct, since the result I'm getting following the described prompt / pipeline is not close to the one in the leaderboard. Comparing the summary generation im getting with the one in the hf dataset, I found that theres a large difference. One thing those models are sturggling with for example that I don't see in HF dataset was sentence reptition.
So my question is is generation config used for hugginface models ? I'm currently using the text-gerenation pipelien setting do_sample=False (as i found mentioned in another issue that 0 temperature was used), if code can be provided it could be also helpeful to see whats giving raise to this variation in results.
Edit, I also can't reproduce the leadeboard score for Llama 3.2-1B using the generated summaries in the HF dataset linked, this is because
1 - I don't know what threshold was used to determine if a response is hallucinated / consisten or notedit : I will use top_k = 12 - The dataset includes the ommited samples (length is 1006 and not 850ish)